join
两个DataFrame根据某个条件进行关联。类似的还有crossJoin
返回一个笛卡尔积表
cond = [df['name'] == df1['name'], df['age'] == df1['age']]
df.join(df1, cond, 'outer').select(df.name, df3.age)
lit
Creates a column of literal value.
file = '/mnt/dbwarehouse/raw/user.csv'
df = df.withColumn('ingest_file',lit(file))\
.withColumn('converted',lit(True))
display(df)
na.fill & fillna
df.na.fill(50).show() # 可以看到name列为文本,则不进行填充
+---+------+-----+
|age|height| name|
+---+------+-----+
| 10| 80|Alice|
| 5| 50| Bob|
| 50| 50| Tom|
| 50| 50| null|
+---+------+-----+
df.na.fill(False).show() # 只有spy列为bool,故只填充这一列
+----+-------+-----+
| age| name| spy|
+----+-------+-----+
| 10| Alice|false|
| 5| Bob|false|
|null|Mallory| true|
+----+-------+-----+
df.na.fill({'age': 50, 'name': 'unknown'}).show() # 按字段名指定填充得值
+---+------+-------+
|age|height| name|
+---+------+-------+
| 10| 80| Alice|
| 5| null| Bob|
| 50| null| Tom|
| 50| null|unknown|
+---+------+-------+