0
点赞
收藏
分享

微信扫一扫

Pandas的Merge实现

花明 2021-09-28 阅读 67

Pandas怎样实现DataFrame的Merge

Pandas的Merge,相当于Sql的Join,将不同的表按key关联到一个表

merge的语法:

pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
left_index=False, right_index=False, sort=True,
suffixes=('_x', '_y'), copy=True, indicator=False,
validate=None)

  • left,right:要merge的dataframe或者有name的Series
  • how:join类型,'left', 'right', 'outer', 'inner'
  • on:join的key,left和right都需要有这个key
  • left_on:left的df或者series的key
  • right_on:right的df或者seires的key
  • left_index,right_index:使用index而不是普通的column做join
  • suffixes:两个元素的后缀,如果列有重名,自动添加后缀,默认是('_x', '_y')

文档地址:https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

本章终点

  1. 电影数据集的join实例
  2. 理解merge时一对一、一对多、多对多的数量对齐关系
  3. 理解left join、right join、inner join、outer join的区别
  4. 如果出现非Key的字段重名怎么办
一、电影数据集的join实例
import pandas as pd
df_ratings = pd.read_csv(
    r"D:\node\nd\Pandas_study\pandas_test\ratings.dat",
    sep="::",
    engine='python',
    names="UserID::MovieID::Rating::Timestamp".split("::")
)

ratings = df_ratings.head()
print(ratings)

df_users = pd.read_csv(
    r"D:\node\nd\Pandas_study\pandas_test\users.dat",
    sep="::",
    engine='python',
    names="UserID::Gender::Age::Occupation::Zip-code".split("::")
)

users = df_users.head()
print(users)
df_movies = pd.read_csv(
    r"D:\node\nd\Pandas_study\pandas_test\movies.dat",
    sep="::",
    engine='python',
    names="MovieID::Title::Genres".split("::")
)

movies = df_movies.head()
print(movies)

1.评分数据和用户数据进行关联
df_ratings_user = pd.merge(
    df_ratings,df_users,left_on="UserID",right_on="UserID",how = "inner"
)
print(df_ratings_user.head())

2、df_ratings_user形成的新表和电影表关联
df_ratings_user_movie = pd.merge(
    df_ratings_user,df_movies,left_on="MovieID",right_on="MovieID",how="inner"
)

二、解merge时一对一、一对多、多对多的数量对齐关系

以下关系要正确理解:

  • one-to-one:一对一关系,关联的key都是唯一的
    • 比如(学号,姓名) merge (学号,年龄)
    • 结果条数为:1*1


left = pd.DataFrame({'sno': [11, 12, 13, 14],
                      'name': ['name_a', 'name_b', 'name_c', 'name_d']
                    })
print(left)
right = pd.DataFrame({'sno': [11, 12, 13, 14],
                      'age': ['21', '22', '23', '24']
                    })
print(right)
a = pd.merge(
    left,right,on="sno"
)

print(a)

  • one-to-many:一对多关系,左边唯一key,右边不唯一key
    • 比如(学号,姓名) merge (学号,[语文成绩、数学成绩、英语成绩])
    • 结果条数为:1*N


left = pd.DataFrame({'sno': [11, 12, 13, 14],
                      'name': ['name_a', 'name_b', 'name_c', 'name_d']
                    })
print(left)

right = pd.DataFrame({'sno': [11, 11, 11, 12, 12, 13],
                       'grade': ['语文88', '数学90', '英语75','语文66', '数学55', '英语29']
                     })
print(right)
a = pd.merge(
    left,right,on="sno"
)
print(a)

  • many-to-many:多对多关系,左边右边都不是唯一的
    • 比如(学号,[语文成绩、数学成绩、英语成绩]) merge (学号,[篮球、足球、乒乓球])
    • 结果条数为:M*N


left = pd.DataFrame({'sno': [11, 11, 12, 12,12],
                      '爱好': ['篮球', '羽毛球', '乒乓球', '篮球', "足球"]
                    })
print(left)
right = pd.DataFrame({'sno': [11, 11, 11, 12, 12, 13],
                       'grade': ['语文88', '数学90', '英语75','语文66', '数学55', '英语29']
                     })
print(right)
a = pd.merge(
    left,right,on="sno"
)
print(a)

三、理解left join、right join、inner join、outer join的区别

3-1 inner join 默认
left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                      'A': ['A0', 'A1', 'A2', 'A3'],
                      'B': ['B0', 'B1', 'B2', 'B3']})
print(left)
right = pd.DataFrame({'key': ['K0', 'K1', 'K4', 'K5'],
                      'C': ['C0', 'C1', 'C4', 'C5'],
                      'D': ['D0', 'D1', 'D4', 'D5']})

print(right)
a = pd.merge(
    left,right,how="inner"
)
print(a)

3-2 left join 左边都会出现在结果里,右边的如果无法匹配则为null
b = pd.merge(
    left,right,how="left"
)
print(b)

3-3 right join右边都会出现在结果里,左边的如果无法匹配则为null
c = pd.merge(
    left,right,how="right"
)
print(c)

3-5 outer join 左边、右边都会出现在结果里,如果无法匹配则为null
d = pd.merge(
    left,right,how="outer"
)
print(d)

四、如果出现非Key的字段重名怎么办
left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                      'A': ['A0', 'A1', 'A2', 'A3'],
                      'B': ['B0', 'B1', 'B2', 'B3']})

right = pd.DataFrame({'key': ['K0', 'K1', 'K4', 'K5'],
                      'A': ['A10', 'A11', 'A12', 'A13'],
                      'D': ['D0', 'D1', 'D4', 'D5']})
print(left)
print(right)
a = pd.merge(
    left,right,on="key"
)
print(a)

b = pd.merge(
    #suffixes指定相同参数的后缀
    left,right,on="key",suffixes=("_left","_right")
)
print(b)

举报

相关推荐

0 条评论