Pandas索引index的用途-CFANZ编程社区

Pandas的索引index的用途

把数据存储于普通的column列也能用于数据查询，那使用index有什么好处？

index的用途总结：

更方便的数据查询；
使用index可以获得性能提升；
自动的数据对齐功能；
更多更强大的数据结构支持；

import pandas as pd
from sklearn.utils import shuffle

fpath = r"D:\node\nd\Pandas_study\pandas_test\ratings.csv"
df = pd.read_csv(fpath)
#inplace参数，是否替换原df，drop当userId变为主键的时候，
#是否保留userID这个列，False代表保留，True代表不保留
df.set_index("userId",inplace = True,drop = False)
print(df)

1、使用index查询数据

#两种方法都一样
c = df.loc[500].head()
d = df.loc[df["userId"] == 500].head()
print(d)

2. 使用index会提升查询性能

如果index是唯一的，Pandas会使用哈希表优化，查询性能为O(1);
如果index不是唯一的，但是有序，Pandas会使用二分查找算法，查询性能为O(logN);
如果index是完全随机的，那么每次查询都要扫描全表，查询性能为O(N);

2-1完全随机的顺序查询

from sklearn.utils import shuffle
#随机打乱df
df_shuffle = shuffle(df)
print(df_shuffle.head())
#运行结果
#         userId  movieId  rating   timestamp
#userId                                     
#181        181      185     4.0   845469472
#157        157     1961     4.0   992479546
#121        121      272     1.0   847656374
#525        525   122886     4.0  1476475870
#45          45     3555     4.0  1057007329

#索引是否是递增的
print("索引是不是递增的：",df_shuffle.index.is_monotonic_increasing)
#索引是不是递增的： False
print("索引是否是唯一的：",df_shuffle.index.is_unique)
#索引是否是唯一的： False

2-1对索引进行排序

#随机打乱df
df_shuffle = shuffle(df)
#对索引进行排序
a = df_shuffle.sort_index()
print(a.head())
#        userId  movieId  rating  timestamp
#userId                                    
#1            1     2858     5.0  964980868
#1            1     1208     4.0  964983250
#1            1      423     3.0  964982363
#1            1     1927     5.0  964981497
#1            1     1552     4.0  964982620
b = a.index.is_monotonic_increasing
print("索引是不是递增的：",b)
#索引是不是递增的： True
print("索引是否是唯一的：",a.index.is_unique)
#索引是否是唯一的： False