hive中经常会用到row_number这个函数,比如取用户第n次购买,前n次购买记录等等。那么python中如何实现呢?直接看个例子即可
下面是a、b两个用户购买的记录,user为用户名,amount为消费金额,要去按照user分组,组内按照amount降序排序,并且新增一列标识序号
import pandas as pd
df = pd.DataFrame({'user':['a','a','a','b','b'],'amount':[11,11,31,32,42]})
user | amount | |
0 | a | 11 |
1 | a | 11 |
2 | a | 31 |
3 | b | 32 |
4 | b | 42 |
法一
import pandas as pd
def row_number(df, groupbyKey, rankCol, ascending=False):
def sub_row_number(df, rankCol, ascending=False):
"""
df:数据框
rankCol:为需要对之排序的列
istopn:返回每一组的第n行数据
"""
count = len(df)
temp_data = df.sort_values(by=rankCol, ascending=ascending)
temp_data['row'] = range(len(df))
return temp_data
result = df.groupby([groupbyKey]).apply(sub_row_number,
rankCol=rankCol,
ascending=ascending)
result.index = range(len(result))
return
row_number(df, groupbyKey='user',rankCol="amount",ascending=True)
user | amount | row | |
0 | a | 11 | 0 |
1 | a | 11 | 1 |
2 | a | 31 | 2 |
3 | b | 32 | 0 |
4 | b | 42 | 1 |
法二
def row_number(df, groupbyKey, rankCol, ascending=True):
df["rw"] = df.groupby(groupbyKey)[rankCol].rank(method ="first",ascending=ascending )
return
row_number(df, groupbyKey='user',rankCol="amount",ascending=False)
user | amount | rw | |
0 | a | 11 | 2.0 |
1 | a | 11 | 3.0 |
2 | a | 31 | 1.0 |
3 | b | 32 | 2.0 |
4 | b | 42 | 1.0 |
2018-12-105 于南京市栖霞区紫东创业园