Grouping data
Use the DataFrame groupBy
method to create a grouped data object
This grouped data object is called RelationalGroupedDataset
in Scala and GroupedData
in Python
df.groupBy("geo.state", "geo.city")
Grouped data methods
Various aggregate methods are available on the grouped data object
eventCountsDF = df.groupBy("event_name").count()
display(eventCountsDF)
cityPurchaseQuantitiesDF = df.groupBy("geo.state", "geo.city").sum("ecommerce.total_item_quantity")
display(cityPurchaseQuantitiesDF)
Built-in aggregate functions
这是最常用,最通用的方式,可以对多个值进行统计,并设置别名。上面的方法都太过于简单。
Use the grouped data method agg
to apply built-in aggregate functions
This allows you to apply other transformations on the resulting columns, such as alias
from pyspark.sql.functions import avg, approx_count_distinct
stateAggregatesDF = df.groupBy("geo.state").agg(
avg("ecommerce.total_item_quantity").alias("avg_quantity"),
approx_count_distinct("user_id",0.01).alias("distinct_users"))
display(stateAggregatesDF)
http://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.approx_count_distinct.html#pyspark.sql.functions.approx_count_distinct