0
点赞
收藏
分享

微信扫一扫

Scala002-DataFrame筛选多个列

非衣所思 2022-08-04 阅读 76


Intro

  用scala做数据清洗,需要把两份数据union,为了防止两个dataframe的字段不一致,需要先筛选出两者共有的,此为背景。版本信息:

  • scala:2.11.12
  • spark:2.4.4

数据构造

import org.apache.spark.sql.functions._
import spark.implicits._
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.{Vector, Vectors}
val df = Seq(
("A1", 25, 1,0.64,0.36),
("A1", 26, 1,0.34,0.66),
("B1", 27, 0,0.55,0.45),
("C1", 30, 0,0.14,0.86)
).toDF("id", "age", "label","pro0","pro1")
df.printSchema()
df.show()

root
|-- id: string (nullable = true)
|-- age: integer (nullable = false)
|-- label: integer (nullable = false)
|-- pro0: double (nullable = false)
|-- pro1: double (nullable = false)

+---+---+-----+----+----+
| id|age|label|pro0|pro1|
+---+---+-----+----+----+
| A1| 25| 1|0.64|0.36|
| A1| 26| 1|0.34|0.66|
| B1| 27| 0|0.55|0.45|
| C1| 30| 0|0.14|0.86|
+---+---+-----+----+----+






import org.apache.spark.sql.functions._
import spark.implicits._
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.{Vector, Vectors}
df: org.apache.spark.sql.DataFrame = [id: string, age: int ... 3 more fields]

多个列筛选

这里的前提是我们拿到了这些列名的集合,直接select的方式取出对应的列

df.columns.slice(0,3)

res1: Array[String] = Array(id, age, label)

df.select(df.columns.slice(0,3).map(name => col(name)): _*).show()

+---+---+-----+
| id|age|label|
+---+---+-----+
| A1| 25| 1|
| A1| 26| 1|
| B1| 27| 0|
| C1| 30| 0|
+---+---+-----+

当然也可以用​​df.select("id","age","label")​​,但是如果列名很多,或者不是固定的,则该方法不适用。

                                2020-03-04 于南京市栖霞区


举报

相关推荐

0 条评论