【python数据分析（6）】Pandas数据结构Series：数据索引（切片）、查看、重新索引（对齐）以及增删修补-CFANZ编程社区

1. Series索引

1.1 位置下标索引，类似序列

1）位置下标从0开始

2）输出结果为numpy.float格式

3）可以通过float()函数转换为python float格式

4） numpy.float与float占用字节不同

5）类似序列，但是不同于序列，不可以反向索引，比如下面的s[-1]

import numpy as py
import pandas as pd

s = pd.Series(np.random.rand(5))
print(s)
print('————————————')
print(s[0],type(s[0]),s[0].dtype)
print(float(s[0]),type(float(s[0])))
#print(s[-1])

–> 输出的结果为：

0    0.342137
1    0.478456
2    0.500634
3    0.015527
4    0.475779
dtype: float64
————————————
0.3421371114012325 <class 'numpy.float64'> float64
0.3421371114012325 <class 'float'>

1.2 标签索引

1）方法类似下标索引，用[]表示，内写上index，注意index是字符串

2）如果需要选择多个标签的值，用[[]]来表示（相当于[]中包含一个列表）

3）多标签索引结果是 新的数组

s = pd.Series(np.random.rand(5), index = ['a','b','c','d','e'])
print(s)
print(s['a'],type(s['a']),s['a'].dtype)
print('————————————')
sci = s[['a','b','e']]
print(sci,type(sci))

–> 输出的结果为：

a    0.036333
b    0.882987
c    0.384999
d    0.730299
e    0.006695
dtype: float64
0.03633292153461065 <class 'numpy.float64'> float64
————————————
a    0.036333
b    0.882987
e    0.006695
dtype: float64 <class 'pandas.core.series.Series'>

1.3 切片索引

1）用index做切片是末端包含，有行数来索引是末端不包含

2）下标索引做切片，和list写法一样

s1 = pd.Series(np.random.rand(5))
s2 = pd.Series(np.random.rand(5), index = ['a','b','c','d','e'])
print(s1[1:4],s1[4])
print(s2['a':'c'],s2['c'])
print(s2[0:3],s2[3])

–> 输出的结果为：

1    0.122908
2    0.755420
3    0.457033
dtype: float64 0.11856745158761783
a    0.822246
b    0.978402
c    0.099879
dtype: float64 0.09987899648438314
a    0.822246
b    0.978402
c    0.099879
dtype: float64 0.05134005655420537

print(s2[:-1])
print(s2[::2])

–> 输出的结果为：

a    0.822246
b    0.978402
c    0.099879
d    0.051340
dtype: float64
a    0.822246
c    0.099879
e    0.743825
dtype:

1.4 布尔型索引

1）数组做判断之后，返回的是一个由布尔值组成的新的数组

2）.isnull() / .notnull() 判断是否为空值 (None代表空值，NaN代表有问题的数值，两个都会识别为空值)

3）用[判断条件]表示，其中判断条件可以是一个语句，或者是一个布尔型数组

s = pd.Series(np.random.rand(3)*100)
s[4] = None  # 添加一个空值
print(s)

–> 输出的结果为：

0     76.386
1    42.4521
2    50.1977
4       None
dtype: object

bs1 = s > 50
bs2 = s.isnull()
bs3 = s.notnull()
print(bs1, type(bs1), bs1.dtype)
print(bs2, type(bs2), bs2.dtype)
print(bs3, type(bs3), bs3.dtype)

–> 输出的结果为：

0     True
1    False
2     True
4    False
dtype: bool <class 'pandas.core.series.Series'> bool
0    False
1    False
2    False
4     True
dtype: bool <class 'pandas.core.series.Series'> bool
0     True
1     True
2     True
4    False
dtype: bool <class 'pandas.core.series.Series'> bool

print(s[s > 50])
print(s[bs3])

–> 输出的结果为：

0     76.386
2    50.1977
dtype: object
0     76.386
1    42.4521
2    50.1977
dtype: object

2. Pandas数据结构Series：基本技巧

2.1 数据查看

1）.head()查看头部数据

2）.tail()查看尾部数据

3）默认查看5条，括号里可以添加数值

s = pd.Series(np.random.rand(50))
print(s.head(7))
print('————————————')
print(s.tail())

–> 输出的结果为：

0    0.534197
1    0.753950
2    0.580653
3    0.160018
4    0.541621
5    0.497129
6    0.005186
dtype: float64
————————
45    0.407972
46    0.093476
47    0.531512
48    0.279823
49    0.863481
dtype:

2.2 重新索引reindex

1）.reindex()将会根据索引重新排序，如果当前索引不存在，则引入缺失值
2）.reindex()中也是写列表
3） fill_value参数：填充缺失值的值

s = pd.Series(np.random.rand(3), index = ['a','b','c'])
print(s)
print('————————————')
s1 = s.reindex(['c','b','a','d'])
print(s1)
print('————————————')
s2 = s.reindex(['c','b','a','d'], fill_value = 0)
print(s2)

–> 输出的结果为：

a    0.901837
b    0.006975
c    0.770128
dtype: float64
————————
c    0.770128
b    0.006975
a    0.901837
d         NaN
dtype: float64
————————
c    0.770128
b    0.006975
a    0.901837
d    0.000000
dtype:

2.3 Series对齐

1） Series 和 ndarray 之间的主要区别是，Series 上的操作会根据标签自动对齐

2） index顺序不会影响数值计算，以标签来计算

3）空值和任何值计算结果仍为空值

s1 = pd.Series(np.random.rand(3), index = ['Jack','Marry','Tom'])
s2 = pd.Series(np.random.rand(3), index = ['Wang','Jack','Marry'])
print(s1)
print('————————————')
print(s2)
print('————————————')
print(s1+s2)

–> 输出的结果为：

Jack     0.978701
Marry    0.060546
Tom      0.046649
dtype: float64
————————————
Wang     0.728876
Jack     0.010574
Marry    0.626720
dtype: float64
————————————
Jack     0.989275
Marry    0.687265
Tom           NaN
Wang          NaN
dtype:

2.4 Series数据删除

1）.drop() 按行删除数据，返回的是一个副本

2）如果要改变原数据，括号内需要添加inplace = True参数

s = pd.Series(np.random.rand(5), index = list('ngjur'))
print(s)
print('————————————')
s1 = s.drop('n')
print(s1)

–> 输出的结果为：（只删除一行）

n    0.447854
g    0.987785
j    0.859756
u    0.579510
r    0.323817
dtype: float64
————————————
g    0.987785
j    0.859756
u    0.579510
r    0.323817
dtype:

s2 = s.drop(['g','j'])
print(s2)

–> 输出的结果为：（删除多行）

n    0.012650
u    0.909471
r    0.883521
dtype:

print(s)
print('————————————')
s.drop(['g','j'],inplace =True)
print(s)

–> 输出的结果为：（始终是未改变s的数据，加了inplace参数后，原数据发生变化）

n    0.836100
g    0.064396
j    0.164702
u    0.021515
r    0.700410
dtype: float64
————————————
n    0.836100
u    0.021515
r    0.700410
dtype:

2.5 Series数据添加

1）直接通过下标索引/标签index添加值

2）通过.append方法，直接添加一个数组

3）.append方法生成一个新的数组，不改变之前的数组

s1 = pd.Series(np.random.rand(5))
s2 = pd.Series(np.random.rand(5), index = list('abcde'))
print(s1)
print('————————————')
print(s2)

–> 输出的结果为：

0    0.896048
1    0.200918
2    0.524321
3    0.967659
4    0.602467
dtype: float64
————————————
a    0.467757
b    0.977001
c    0.819440
d    0.905913
e    0.506907
dtype:

s1[5] = 100
s2['a'] = 100
print(s1)
print('————————————')
print(s2)

–> 输出的结果为：

0      0.896048
1      0.200918
2      0.524321
3      0.967659
4      0.602467
5    100.000000
dtype: float64
————————————
a    100.000000
b      0.977001
c      0.819440
d      0.905913
e      0.506907
dtype:

s3 = s1.append(s2)
print(s3)
print('————————————')
print(s1)

–> 输出的结果为：

0      0.896048
1      0.200918
2      0.524321
3      0.967659
4      0.602467
5    100.000000
a    100.000000
b      0.977001
c      0.819440
d      0.905913
e      0.506907
dtype: float64
————————————
0      0.896048
1      0.200918
2      0.524321
3      0.967659
4      0.602467
5    100.000000
dtype:

2.6 Series数据修改

通过索引直接修改，类似序列

s = pd.Series(np.random.rand(3), index = ['a','b','c'])
print(s)
print('————————————')
s['a'] = 100
s[['b','c']] = 200
print(s)

–> 输出的结果为：

a    0.529583
b    0.367619
c    0.226178
dtype: float64
————————————
a    100.0
b    200.0
c    200.0
dtype: