【python数据分析（8）】 Pandas数据结构Dataframe：选择行和列、索引（切片）-CFANZ编程社区

注意事项：
Dataframe既有行索引也有列索引，可以被看做由Series组成的字典（共用一个索引）

1. 选择列

1.1 `df[]` 一般用于选择列，也可以选择行（默认是进行列选择的）

df = pd.DataFrame(np.random.rand(12).reshape(3,4)*100,
                   index = ['one','two','three'],
                   columns = ['a','b','c','d'])
print(df)

data1 = df['a']
data2 = df[['b','c']]  #
print(data1)
print(data2)

–> 输出的结果为：

58.508966  95.955052  21.001119  11.598748
two    39.940444   4.822591  63.117561  24.915640
three  10.141366  42.279737  81.585248  99.513415

one      58.508966
two      39.940444
three    10.141366
Name: a, dtype: float64

               b          c
one    95.955052  21.001119
two     4.822591  63.117561
three  42.279737  81.585248

1.2`df[]` 用于选择行（一般不这么使用，但是可以这么操作），后面有专门对于行的操作方法

1.3`df[]` 不能通过索引标签名来选择行(比如这里df[‘one’])

data3 = df[:1]
print(data3)
print(type(data3))

–> 输出的结果为：

58.508966  95.955052  21.001119  11.598748 

<class 'pandas.core.frame.DataFrame'>

2. 选择行

2.1 `df.loc[]` - 按index选择行

df.loc[label]主要针对index选择行，同时支持指定index，及默认数字index

2.1.1 首先创建数组**

df1 = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                   index = ['one','two','three','four'],
                   columns = ['a','b','c','d'])
df2 = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                   columns = ['a','b','c','d'])
print(df1)
print(df2)

–> 输出的结果为：

32.739293  74.631681  57.738041  64.283459
two    49.329576  96.607287  37.576970  21.803517
three  62.766459  49.264659  71.193031  22.111200
four   48.914713  84.778627  49.706254   7.874963
           a          b          c          d
0  79.514782  45.871142  57.086445  11.709671
1   3.236386  61.162491  18.101219  38.525494
2  46.595874  13.619774  15.503499   0.832061
3  52.592679  18.123406  54.248833  59.938835

2.1.2 单标签索引（根据有无标签名进行索引），返回Series

data1 = df1.loc['one']
data2 = df2.loc[1]
print(data1)
print(data2)

–> 输出的结果为：（Series的name会以索引的标签为名）

a    32.739293
b    74.631681
c    57.738041
d    64.283459
Name: one, dtype: float64
a     3.236386
b    61.162491
c    18.101219
d    38.525494
Name: 1, dtype:

2.1.3 多标签索引，如果标签不存在，则返回NaN（索引顺序可变）

data3 = df1.loc[['two','three','five']]
data4 = df2.loc[[3,2,1]]
print(data3)
print(data4)

–> 输出的结果为：（注意pandas版本的问题）

49.329576  96.607287  37.576970  21.803517
three  62.766459  49.264659  71.193031  22.111200
five         NaN        NaN        NaN        NaN
           a          b          c          d
3  52.592679  18.123406  54.248833  59.938835
2  46.595874  13.619774  15.503499   0.832061
1   3.236386  61.162491  18.101219  38.525494

2.1.4 切片索引，末端包含

data5 = df1.loc['one':'three']
data6 = df2.loc[1:3]
print(data5)
print(data6)

–> 输出的结果为：

32.739293  74.631681  57.738041  64.283459
two    49.329576  96.607287  37.576970  21.803517
three  62.766459  49.264659  71.193031  22.111200
           a          b          c          d
1   3.236386  61.162491  18.101219  38.525494
2  46.595874  13.619774  15.503499   0.832061
3  52.592679  18.123406  54.248833  59.938835

2.2 `df.iloc[]` - 按照整数位置选择行

类似list的索引，其顺序就是dataframe的整数位置（从轴的0到length-1）

2.2.1 首先创建数组

df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                   index = ['one','two','three','four'],
                   columns = ['a','b','c','d'])
print(df)

–> 输出的结果为：

64.196153   3.181391  71.407232  66.672682
two    46.100913  51.140302  92.888548  12.207747
three  55.724660  28.906997  21.150581   6.250792
four   80.663114  36.770303  88.255988  21.949060

2.2.2 单标签索引，和`loc[]`索引不同，不能索引超出数据行数的整数位置，比如下面的`.iloc[4]`

print(df.iloc[0])
print(df.iloc[-1])
#print(df.iloc[4])

–> 输出的结果为：

a    64.196153
b     3.181391
c    71.407232
d    66.672682
Name: one, dtype: float64
a    80.663114
b    36.770303
c    88.255988
d    21.949060
Name: four, dtype:

2.2.3 多标签索引，索引顺序可变

print(df.iloc[[0,2]])
print(df.iloc[[3,2,1]])

–> 输出的结果为：

64.196153   3.181391  71.407232  66.672682
three  55.724660  28.906997  21.150581   6.250792
               a          b          c          d
four   80.663114  36.770303  88.255988  21.949060
three  55.724660  28.906997  21.150581   6.250792
two    46.100913  51.140302  92.888548  12.207747

2.2.4 切片索引，末端不包含（注意和上面的区别）

print(df.iloc[1:3])
print(df.iloc[::2])

–> 输出的结果为：

46.100913  51.140302  92.888548  12.207747
three  55.724660  28.906997  21.150581   6.250792
               a          b          c          d
one    64.196153   3.181391  71.407232  66.672682
three  55.724660  28.906997  21.150581   6.250792

3 布尔型索引

3.1 前期准备数据

df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                   index = ['one','two','three','four'],
                   columns = ['a','b','c','d'])
print(df)

–> 输出的结果为：

38.986549  81.009721  57.779180   6.768009
two    61.818468  24.443819  72.064397  87.910932
three  66.612955  48.643065  36.655897  37.299216
four    3.155591  25.298921   1.175081  49.936492

3.2 全局索引

b1 = df < 20
print(b1,type(b1))
print(df[b1]) 
 # 也可以书写为 df[df < 20]

–> 输出的结果为：

False  False  False   True
two    False  False  False  False
three  False  False  False  False
four    True  False   True  False <class 'pandas.core.frame.DataFrame'>

              a   b         c         d
one         NaN NaN       NaN  6.768009
two         NaN NaN       NaN       NaN
three       NaN NaN       NaN       NaN
four   3.155591 NaN  1.175081

3.3 单列（行）判断索引

b2 = df['a'] > 50
print(b2,type(b2))
print(df[b2]) 
# 也可以书写为 df[df['a'] > 50]

–> 输出的结果为：

one      False
two       True
three     True
four     False
Name: a, dtype: bool <class 'pandas.core.series.Series'>

               a          b          c          d
two    61.818468  24.443819  72.064397  87.910932
three  66.612955  48.643065  36.655897  37.299216

3.4 多列做判断索引

b3 = df[['a','b']] > 50
print(b3,type(b3))
print(df[b3])  
# 也可以书写为 df[df[['a','b']] > 50]

–> 输出的结果为：

False   True
two     True  False
three   True  False
four   False  False <class 'pandas.core.frame.DataFrame'>

               a          b   c   d
one          NaN  81.009721 NaN NaN
two    61.818468        NaN NaN NaN
three  66.612955

3.5 多行做判断索引

b4 = df.loc[['one','three']] < 50
print(b4,type(b4))
print(df[b4])  
# 也可以书写为 df[df.loc[['one','three']] < 50]

–> 输出的结果为：

True  False  False  True
three  False   True   True  True <class 'pandas.core.frame.DataFrame'>

               a          b          c          d
one    38.986549        NaN        NaN   6.768009
two          NaN        NaN        NaN        NaN
three        NaN  48.643065  36.655897  37.299216