0
点赞
收藏
分享

微信扫一扫

如何并排输出两个Pandas数据框中的差异?

elvinyang 2022-07-18 阅读 74


我试图强调两个数据框之间的确切变化。

假设我有两个Python Pandas数据框:

"StudentRoster Jan-1":
id Name score isEnrolled Comment
111 Jack 2.17 True He was late to class
112 Nick 1.11 False Graduated
113 Zoe 4.12 True

"StudentRoster Jan-2":
id Name score isEnrolled Comment
111 Jack 2.17 True He was late to class
112 Nick 1.21 False Graduated
113 Zoe 4.12 False On vacation

我的目标是输出一个HTML表格:

  1. 标识已更改的行(可以是int,float,boolean,string)
  2. 输出具有相同,旧和新值的行(理想情况下放入HTML表格中),以便用户可以清楚地看到两个数据框之间的变化:​​"StudentRoster Difference Jan-1 - Jan-2": id Name score isEnrolled Comment 112 Nick was 1.11| now 1.21 False Graduated 113 Zoe 4.12 was True | now False was "" | now "On vacation"​

 

第一部分与Constantine相似,你可以得到其中行为空的布尔值*:

In [21]: ne = (df1 != df2).any(1)

In [22]: ne
Out[22]:
0 False
1 True
2 True
dtype: bool

然后我们可以看到哪些条目已经改变:

In [23]: ne_stacked = (df1 != df2).stack()

In [24]: changed = ne_stacked[ne_stacked]

In [25]: changed.index.names = ['id', 'col']

In [26]: changed
Out[26]:
id col
1 score True
2 isEnrolled True
Comment True
dtype: bool

这里第一项是索引,第二项是已更改的列。

In [27]: difference_locations = np.where(df1 != df2)

In [28]: changed_from = df1.values[difference_locations]

In [29]: changed_to = df2.values[difference_locations]

In [30]: pd.DataFrame({'from': changed_from, 'to': changed_to}, index=changed.index)
Out[30]:
from to
id col
1 score 1.11 1.21
2 isEnrolled True False
Comment None On vacation

注意:​​df1​​​并且​​df2​​​共享相同的索引。为了克服这种模糊性,可以确保你只使用共享标签​​df1.index & df2.index​

 

举报

相关推荐

0 条评论