0
点赞
收藏
分享

微信扫一扫

【python数据分析(8)】 Pandas数据结构Dataframe:选择行和列、索引(切片)


注意事项:
Dataframe既有行索引也有列索引,可以被看做由Series组成的字典(共用一个索引)

1. 选择列

1.1 ​​df[]​​ 一般用于选择列,也可以选择行(默认是进行列选择的)

df = pd.DataFrame(np.random.rand(12).reshape(3,4)*100,
index = ['one','two','three'],
columns = ['a','b','c','d'])
print(df)

data1 = df['a']
data2 = df[['b','c']] #
print(data1)
print(data2)

–> 输出的结果为:

58.508966  95.955052  21.001119  11.598748
two 39.940444 4.822591 63.117561 24.915640
three 10.141366 42.279737 81.585248 99.513415

one 58.508966
two 39.940444
three 10.141366
Name: a, dtype: float64

b c
one 95.955052 21.001119
two 4.822591 63.117561
three 42.279737 81.585248

1.2​​df[]​​ 用于选择行(一般不这么使用,但是可以这么操作),后面有专门对于行的操作方法

1.3​​df[]​​ 不能通过索引标签名来选择行(比如这里df[‘one’])

data3 = df[:1]
print(data3)
print(type(data3))

–> 输出的结果为:

58.508966  95.955052  21.001119  11.598748 

<class 'pandas.core.frame.DataFrame'>

2. 选择行

2.1 ​​df.loc[]​​ - 按index选择行

​df.loc[label]​​主要针对index选择行,同时支持指定index,及默认数字index

2.1.1 首先创建数组**

df1 = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
index = ['one','two','three','four'],
columns = ['a','b','c','d'])
df2 = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
columns = ['a','b','c','d'])
print(df1)
print(df2)

–> 输出的结果为:

32.739293  74.631681  57.738041  64.283459
two 49.329576 96.607287 37.576970 21.803517
three 62.766459 49.264659 71.193031 22.111200
four 48.914713 84.778627 49.706254 7.874963
a b c d
0 79.514782 45.871142 57.086445 11.709671
1 3.236386 61.162491 18.101219 38.525494
2 46.595874 13.619774 15.503499 0.832061
3 52.592679 18.123406 54.248833 59.938835

2.1.2 单标签索引(根据有无标签名进行索引),返回Series

data1 = df1.loc['one']
data2 = df2.loc[1]
print(data1)
print(data2)

–> 输出的结果为:(Series的name会以索引的标签为名)

a    32.739293
b 74.631681
c 57.738041
d 64.283459
Name: one, dtype: float64
a 3.236386
b 61.162491
c 18.101219
d 38.525494
Name: 1, dtype:

2.1.3 多标签索引,如果标签不存在,则返回NaN(索引顺序可变)

data3 = df1.loc[['two','three','five']]
data4 = df2.loc[[3,2,1]]
print(data3)
print(data4)

–> 输出的结果为:(注意pandas版本的问题)

49.329576  96.607287  37.576970  21.803517
three 62.766459 49.264659 71.193031 22.111200
five NaN NaN NaN NaN
a b c d
3 52.592679 18.123406 54.248833 59.938835
2 46.595874 13.619774 15.503499 0.832061
1 3.236386 61.162491 18.101219 38.525494

2.1.4 切片索引,末端包含

data5 = df1.loc['one':'three']
data6 = df2.loc[1:3]
print(data5)
print(data6)

–> 输出的结果为:

32.739293  74.631681  57.738041  64.283459
two 49.329576 96.607287 37.576970 21.803517
three 62.766459 49.264659 71.193031 22.111200
a b c d
1 3.236386 61.162491 18.101219 38.525494
2 46.595874 13.619774 15.503499 0.832061
3 52.592679 18.123406 54.248833 59.938835

2.2 ​​df.iloc[]​​ - 按照整数位置选择行

类似list的索引,其顺序就是dataframe的整数位置(从轴的0到length-1)

2.2.1 首先创建数组

df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
index = ['one','two','three','four'],
columns = ['a','b','c','d'])
print(df)

–> 输出的结果为:

64.196153   3.181391  71.407232  66.672682
two 46.100913 51.140302 92.888548 12.207747
three 55.724660 28.906997 21.150581 6.250792
four 80.663114 36.770303 88.255988 21.949060

2.2.2 单标签索引,和​​loc[]​​​索引不同,不能索引超出数据行数的整数位置,比如下面的​​.iloc[4]​

print(df.iloc[0])
print(df.iloc[-1])
#print(df.iloc[4])

–> 输出的结果为:

a    64.196153
b 3.181391
c 71.407232
d 66.672682
Name: one, dtype: float64
a 80.663114
b 36.770303
c 88.255988
d 21.949060
Name: four, dtype:

2.2.3 多标签索引,索引顺序可变

print(df.iloc[[0,2]])
print(df.iloc[[3,2,1]])

–> 输出的结果为:

64.196153   3.181391  71.407232  66.672682
three 55.724660 28.906997 21.150581 6.250792
a b c d
four 80.663114 36.770303 88.255988 21.949060
three 55.724660 28.906997 21.150581 6.250792
two 46.100913 51.140302 92.888548 12.207747

2.2.4 切片索引,末端不包含(注意和上面的区别)

print(df.iloc[1:3])
print(df.iloc[::2])

–> 输出的结果为:

46.100913  51.140302  92.888548  12.207747
three 55.724660 28.906997 21.150581 6.250792
a b c d
one 64.196153 3.181391 71.407232 66.672682
three 55.724660 28.906997 21.150581 6.250792

3 布尔型索引

3.1 前期准备数据

df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
index = ['one','two','three','four'],
columns = ['a','b','c','d'])
print(df)

–> 输出的结果为:

38.986549  81.009721  57.779180   6.768009
two 61.818468 24.443819 72.064397 87.910932
three 66.612955 48.643065 36.655897 37.299216
four 3.155591 25.298921 1.175081 49.936492

3.2 全局索引

b1 = df < 20
print(b1,type(b1))
print(df[b1])
# 也可以书写为 df[df < 20]

–> 输出的结果为:

False  False  False   True
two False False False False
three False False False False
four True False True False <class 'pandas.core.frame.DataFrame'>

a b c d
one NaN NaN NaN 6.768009
two NaN NaN NaN NaN
three NaN NaN NaN NaN
four 3.155591 NaN 1.175081

3.3 单列(行)判断索引

b2 = df['a'] > 50
print(b2,type(b2))
print(df[b2])
# 也可以书写为 df[df['a'] > 50]

–> 输出的结果为:

one      False
two True
three True
four False
Name: a, dtype: bool <class 'pandas.core.series.Series'>

a b c d
two 61.818468 24.443819 72.064397 87.910932
three 66.612955 48.643065 36.655897 37.299216

3.4 多列做判断索引

b3 = df[['a','b']] > 50
print(b3,type(b3))
print(df[b3])
# 也可以书写为 df[df[['a','b']] > 50]

–> 输出的结果为:

False   True
two True False
three True False
four False False <class 'pandas.core.frame.DataFrame'>

a b c d
one NaN 81.009721 NaN NaN
two 61.818468 NaN NaN NaN
three 66.612955

3.5 多行做判断索引

b4 = df.loc[['one','three']] < 50
print(b4,type(b4))
print(df[b4])
# 也可以书写为 df[df.loc[['one','three']] < 50]

–> 输出的结果为:

True  False  False  True
three False True True True <class 'pandas.core.frame.DataFrame'>

a b c d
one 38.986549 NaN NaN 6.768009
two NaN NaN NaN NaN
three NaN 48.643065 36.655897 37.299216


举报

相关推荐

0 条评论