Pandas库-CFANZ编程社区

分析数据

1. Pandas库入门

提供高性能易用数据类型和分析工具

1.1 Pandas简介

安装

pip install -i https://mirrors.aliyun.com/pypi/simple/ Panda

调用

import pandas as pd
a = pd.Series(range(3))
a
# Out[]:
0      0
1      1
2      2

Pandas库的理解

NumPy实现的扩展库，常与NumPy和Matplotlib一同使用
2个数据类型：Series（一维）、DataFrame（二维及其以上）
提供基于上述数据类型的各种操作：基本操作、运算操作、特征类操作、关联类操作

NumPy	Pandas
基础数据类型`ndarray`	扩展数据类型`Series` `DataFrame`
关注数据的结构表达（数据间的维度）	关注数据应用表达
维度：数据间关系	数据与索引间关系

1.2 `Series`

Series类型由一组数据及与之相关的数据索引组成

1.2.1 Series类型的创建

Python列表创建

index与列表元素个数一致

import pandas as pd
# 构造一个Series对象
a = pd.Series([3, 4])
a
# OUT[]:
0    3
1    4
dtype: int64

b = pd.Series([3, 4], index = ['a', 'b'])
b
# Out[]:
a    3
b    4
dtype: int64

标量创建

index表达Series类型的尺寸

pd.Series(25, index = [0, 1, 2])
# Out[]:
0    25
1    25
2    25
dtype: int64

字典类型创建

键值对对应索引和值，并可通过index从字典中进行选择操作

pd.Series({'a' : 2, 'b' : 3})
# Out[]:
a    2
b    3
dtype: int64

# 调整索引位置
# 可看做index从字典中选择操作
pd.Series({'a' : 2, 'b' : 3}, index= ['b', 'c', 'a'])
# Out[]:
b    3.0
c    NaN
a    2.0
dtype: float64

ndarray创建

import numpy as np
pd.Series(np.arange(3))
# Out[]:
0    0
1    1
2    2
dtype: int32

1.2.2 Series类型的基本操作

Series类型包含index和values两部分

Series类型的操作类似ndarray类型

索引方法相同，采用[]
NumPy中的运算和操作可用于Series类型
可通过自定义索引的列表进行切片
通过自动索引进行切片

a = pd.Series([2, 5, 7,4], index=['a', 'b', 'c', 'd'])
# 获得索引
a.index # Index(['a', 'b', 'c', 'd'], dtype='object')
# 获得值
a.values # array([2, 5, 7, 4], dtype=int64)

a['a'] # 2
a[0] # 2 【单个索引，只得到相应值】

a[['a', 'b', 2]] # Traceback (most recent call last)【报错】
a[['a', 'b', 'c']]
# a    2
# b    5
# c    7
# dtype: int64

a[:3]
# a    2
# b    5
# c    7
# dtype: int64

# 用比较关系型索引
a[a > a.median()]
# b    5
c    7
# dtype: int64

a**2
# a     4
# b    25
# c    49
# d    16
# dtype: int64

Series类型的操作类似Python字典类型

通过自定义索引访问
使用保留字in
使用.get()方法

a = pd.Series([2, 5, 7,4], index=['a', 'b', 'c', 'd'])

'c' in b # True
0 in b # False

a.get('f', 100) # 100
a.get('a', 100) # 2

Series对齐操作

Series类型在运算中会自动对齐不同索引的数据

a = pd.Series([2, 5, 7, 4], index=['a', 'b', 'c', 'd'])
b = pd.Series([6, 8, 9], index=['c', 'a', 'f'])
a + b
# a    10.0
# b     NaN
# c    13.0
# d     NaN
# f     NaN
# dtype: float64

1.2.3 Series类型的name属性

Series对象和索引都可以有一个名字，存储在.name中

a = pd.Series([2, 5, 7, 4], index=['a', 'b', 'c', 'd'])
a.name = 'Series对象'
a.index.name = '索引号'

a
# 索引号
# a    2
# b    5
# c    7
# d    4
# Name: Series对象, dtype: int64

1.2.4 Series类型的修改

Series对象可以随时修改并立即生效

a = pd.Series([2, 5, 7, 4], index=['a', 'b', 'c', 'd'])
a['a', 'b'] = 20 ,22
a['a'] # 20
a['b'] # 22

a['a', 'b'] = 20
a['a'] # 20
a['b'] # 20

理解Series类型，主要是理解Series是一维带“标签”数组。基本操作类似ndarray和字典，根据索引对齐

1.3 DataFrame

DataFrame类型由共用相同索引的一组数列组成

表格型的数据类型，每一列的数据类型可以相同也可以不同
有行索引（index）和列索引（column）
可以表达二维数据，也可表示多维数据

1.3.1 DataFrame类型创建

二维ndarray对象

import pandas as pd
import numpy as np
a = pd.DataFrame(np.arange(10).reshape(2,5))
a
#   0   1   2   3   4
# 0 0   1   2   3   4
# 1 5   6   7   8   9

由维ndarray对象、列表、字典、元组或Series构成的字典创建

azd = {'one': pd.Series([1, 2, 3], index = ['a', 'b', 'c']),
      'two': pd.Series([6, 7, 8, 9], index = ['a', 'b', 'c'， ‘d])}
a = pd.DataFrame(azd)
a
#   one two
# a 1   6
# b 2   7
# c 3   8
# d NaN 9

pd.DataFrame(a, index=['b', 'c', 'd'], columns=['two', 'there'])
#   two there
# b 7   NaN
# c 8   NaN
# d 9   NaN

alb = {'one': [1, 2, 3, 4], 'two': [6, 7, 8, 9]}
a = pd.DataFrame(alb, index=['a', 'b', 'c', 'd'])
a
#   one two
# a 1   6
# b 2   7
# c 3   8
# d 4   9

Series类型创建
其他DataFrame类型创建

1.3.2 DataFrame类型的基本操作

a.index # Index(['a', 'b', 'c', 'd'], dtype='object')
a.columns # Index(['one', 'two'], dtype='object')

# 获得列，直接用列索引
a['two']
# a    6
# b    7
# c    8
# d    9
# Name: two, dtype: int64

# 获得行
a.loc['a']
# one    1
# two    6
# Name: a, dtype: int64

# 获得某个值
a['one']['a'] # 1

# 直接修改
a['one']['a']  = 2
a
#   one two
# a 2   6
# b 2   7
# c 3   8
# d 4   9

理解DataFrame类型，主要是理解二维带“标签”数组，基本操作类似Series，依据行列索引

1.4 类型操作

1.4.1 Pandas数据类型增加或重排：.reinex()

import pandas as pd
azd = {'城市' : ['北京', '上海', '广州', '长沙'],
    '环比': [1, 2, 3, 4],
    '地基': [5, 6, 7, 8],
    '同比': [9, 10, 11, 12]}
a = pd.DataFrame(azd, index = ['c1', 'c2', 'c3', 'c4'])
a
#     城市    环比  地基  同比
# c1    北京  1   5   9
# c2    上海  2   6   10
# c3    广州  3   7   11
# c4    长沙  4   8   12

# 调整行的排列
a.reindex(index=['c3', 'c2', 'c1', 'c4'])
#      城市   环比  地基  同比
# c3    广州  3   7   11
# c2    上海  2   6   10
# c1    北京  1   5   9
# c4    长沙  4   8   12

# 调整列的排列
a.reindex(columns=['城市', '地基', '同比','环比'])
#      城市   地基  同比  环比
# c1    北京  5   9   1
# c2    上海  6   10  2
# c3    广州  7   11  3
# c4    长沙  8   12  4

`.reindex()`参数	说明
`index` `columns`	新的行列自定义索引
`fill_value`	重新索引中，用于填充缺失位置的值
`method`	填充方法，`ffill`当前值向前填充，`bfill`向后填充
`limit`	最大填充量
`copy`	默认`True`，生成新的对象，`False`新旧相等不复制

# a.columns是一个列表，用列表的方法，加一个元素（修改索引方法，如下表）
columns_new = a.columns.insert(4, '新增')
columns_new # Index(['城市', '环比', '地基', '同比', '新增'], dtype='object')

b = a.reindex(columns=columns_new, fill_value=10)
b # 此时a不变
#     城市    环比  地基  同比  新增
# c1    北京  1   5   9   10
# c2    上海  2   6   10  10
# c3    广州  3   7   11  10
# c4    长沙  4   8   12  10

Series DataFrame的索引都是Index类型

索引类型常用方法	说明
`.append(idx)`	连接另一个Index对象，产生新的Index对象
`.diff(idx)`	计算差集，参数新的Index对象
`intersection(idx)`	计算交集
`.union(idx)`	计算并集
`.delete(loc)`	删除loc位置处的元素，并生成新的Index对象（列表操作）
`.insert(loc, 'e')`	在loc位置增加一个元素e（列表操作）

ind_new = a.index.insert(4, 'c5')
ind_new # Index(['c1', 'c2', 'c3', 'c4', 'c5'], dtype='object')

b = a.reindex(index=ind_new, method='ffill')
b
#       城市  环比  地基  同比
# c1    北京  1   5   9
# c2    上海  2   6   10
# c3    广州  3   7   11
# c4    长沙  4   8   12
# c5    长沙  4   8   12

1.4.2 删除：drop

# 删除行
a.drop(['c1', 'c2'])
#     城市    环比  地基  同比
# c3    广州  3   7   11
# c4    长沙  4   8   12

# 删除列，给出维度信息，在第二维度操作
a.drop(['环比', '同比'], axis = 1)
#     城市    地基
# c1    北京  5
# c2    上海  6
# c3    广州  7
# c4    长沙  8

1.5 数据运算

同一维度补齐；不同维度广播

1.5.1 Pandas数据类型的算数运算

根据行列索引，补齐后运算，运算默认产生浮点数
补齐默认填充NaN
二维和一维、一维和零维间为广播运算

同维运算：补齐后运算

import pandas as pd
import numpy as np
a = pd.DataFrame(np.arange(12).reshape(3,4))
a
#   0   1   2   3
# 0 0   1   2   3
# 1 4   5   6   7
# 2 8   9   10  11

b = pd.DataFrame(np.arange(20).reshape(4,5))

a + b # 【补齐后运算】
#   0   1   2   3   4
# 0 0.0 2.0 4.0 6.0 NaN
# 1 9.0 11.0    13.0    15.0    NaN
# 2 18.0    20.0    22.0    24.0    NaN
# 3 NaN NaN NaN NaN NaN

方法形式运算	说明
`.add(d,**argws)`	类型间加法运算，可选参数
`.sub(d,**argws)`	类型间减法运算，可选参数
`.mul(d,**argws)`	类型间乘法运算，可选参数
`.div(d,**argws)`	类型间除法运算，可选参数

a.add(b, fill_value= 100)
#   0   1   2   3   4
# 0 0.0 2.0 4.0 6.0 104.0
# 1 9.0 11.0    13.0    15.0    109.0
# 2 18.0    20.0    22.0    24.0    114.0
# 3 3   115.0   116.0   117.0   118.0   119.0

广播运算：不同维度运算，低维作用到高纬每个元素间

c = pd.Series(np.arange(3))
c
# 0    0
# 1    1
# 2    2
# dtype: int32

c - 10
# 0   -10
# 1    -9
# 2    -8
# 3 dtype: int32

二维减去一维：每个第二维度减去c，自动补齐用NaN

a
#   0   1   2   3
# 0 0   1   2   3
# 1 4   5   6   7
# 2 8   9   10  11

a - c
#   0   1   2   3
# 0 0.0 0.0 0.0 NaN
# 1 4.0 4.0 4.0 NaN
# 2 8.0 8.0 8.0 NaN

c - a
#   0   1   2   3
# 0 0.0 0.0 0.0 NaN
# 1 -4.0    -4.0    -4.0    NaN
# 2 -8.0    -8.0    -8.0    NaN

a.sub(c, axis = 0)
#   0   1   2   3
# 0 0   1   2   3
# 1 3   4   5   6
# 2 6   7   8   9

1.5.2 Pandas数据类型的比较运算

只能比较相同索引的元素，不进行补齐
二维和一维、一维和零维间为广播运算
采用> < >= <= == !=等符号进行的二元运算产生布尔对象

同维运算，尺寸一致

import pandas as pd
import numpy as np
a = pd.DataFrame(np.arange(12).reshape(3,4))
b = pd.DataFrame(np.arange(12, 0, -1).reshape(3,4))
b
#   0   1   2   3
# 0 12  11  10  9
# 1 8   7   6   5
# 2 4   3   2   1

a > b
#   0   1   2   3
# 0 False   False   False   False
# 1 False   False   False   True
# 2 True    True    True    True

不同维度，广播运算，默认axis=1

c = pd.Series(np.arange(3))

a > c
#   0   1   2   3
# 0 False   False   False   False
# 1 True    True    True    False
# 2 True    True    True    False

c > 0
# 0    False
# 1     True
# 2     True
# dtype: bool

2. 数据特征分析

2.1 数据的排序

.sort_index()方法在指定轴上根据索引进行排序，默认升序（排序后相应值跟随）

.sort_index(axis = 0, ascending = True)

import pandas as pd
import numpy as np
a = pd.DataFrame(np.arange(12).reshape(3,4), index = ['b', 'c', 'a'], columns = ['b', 'd', 'a', 'c'])

a.sort_index()
#   b   d   a   c
# a 8   9   10  11
# b 0   1   2   3
# c 4   5   6   7

a.sort_index(axis=1)
#   a   b   c   d
# b 2   0   3   1
# c 6   4   7   5
# a 10  8   11  9

.sort_values()方法，在指定轴上，根据数值进行排序，默认升序（排序后，相应索引跟随）
Series.sort_values(axis = 0, ascending = True)
DataFrame.sort_values(by, axis = 0, ascending = True)

a.sort_values('a', axis = 1, ascending=False)
#   c   a   d   b
# b 3   2   1   0
# c 7   6   5   4
# a 11  10  9   8

NaN统一放到排序末尾

2.2 统计分析

统计方法	说明
`.sum()`	计算数据的总和，axis=0（下同）
`.max()` `.mix()`	计算数据的最大值、最小值
`.mean()` `.median()`	计算数据的算数平均值、算数中位数
`.var()` `.std()`	计算数据的方差、标准差
`.count()`	非NaN值的数量

统计方法	说明
`.argmin()` `.argmax()`	计算数据最大值、最小值所在位置的索引位置（自动索引）
`.idmin()` `.idmax()`	计算数据最大值、最小值所在位置的索引（自定义索引）

.describe()：针对axis = 0（列）的统计汇总

import pandas as pd
a = pd.Series([3, 5, 6, 6], index = ['a', 'b', 'c', 'd'])

a.describe()
# count    4.000000
# mean     5.000000
# std      1.414214
# min      3.000000
# 25%      4.500000
# 50%      5.500000
# 75%      6.000000
# max      6.000000
# dtype: float64

type(a.describe()) # pandas.core.series.Series
a.describe()['max'] # 6.0

import pandas as pd
import numpy as np
a = pd.DataFrame(np.arange(12).reshape(3,4), index = ['b', 'c', 'a'], columns = ['b', 'd', 'a', 'c'])
a.describe()
#       b   d   a   c
# count 3.0 3.0 3.0 3.0
# mean  4.0 5.0 6.0 7.0
# std   4.0 4.0 4.0 4.0
# min   0.0 1.0 2.0 3.0
# 25%   2.0 3.0 4.0 5.0
# 50%   4.0 5.0 6.0 7.0
# 75%   6.0 7.0 8.0 9.0
# max   8.0 9.0 10.0    11.0

type(a.describe()) # pandas.core.frame.DataFrame
# 获得b列
a.describe()['b']
# 获得'count'行
a.describe().loc['count']

2.3 累计统计

累计统计分析：对前n个数进行累计运算

累计统计函数	说明
`.cumsum()`	一次给出前1， 2，······，n个数的和（包括n，下同）
`.cumprod()`	一次给出前1， 2，······，n个数的积
`.cummax()`	一次给出前1， 2，······，n个数的最大数
`.cummin()`	一次给出前1， 2，······，n个数的最小数

import pandas as pd
import numpy as np
a = pd.DataFrame(np.arange(12).reshape(3,4), index = ['b', 'c', 'a'], columns = ['b', 'd', 'a', 'c'])
a.cumsum()
#   b   d   a   c
# b 0   1   2   3
# c 4   6   8   10
# a 12  15  18  21

滚动计算（窗口计算）	说明
`.rolling(w).sum()`	依次计算相邻w个元素的和
`.rolling(w).mean()`	依次计算相邻w个元素的算数平均数
`.rolling(w).var()`	依次计算相邻w个元素的方差
`.rolling(w).std()`	依次计算相邻w个元素的标准差
`.rolling(w).max().min()`	依次计算相邻w个元素的最大值和最小值

a.rolling(2).sum()
#   b   d   a   c
# b NaN NaN NaN NaN
# c 4.0 6.0 8.0 10.0
# a 12.0    14.0    16.0    18.0

不管是滚动计算还是累计统计，计算所用的值都是原始值，而不是后面生成的值

2.4 相关分析

X增大，Y增大，2个变量正相关
X增大，Y减小，2个变量负相关
X增大，Y无视，2个变量不相关

协方差：么一个元素与其均值和另一个元素之间进行累计乘加操作

Pearson相关系数：用来衡量两个数据集合是否在一条线上面，它用来衡量定距变量间的线性关系

分析函数	说明
`.cov()`	计算协方差矩阵
`.corr()`	计算相关系数矩阵，Pearson、Spearman、Kendall等系数

import pandas as pd
hprice = pd.Series([3.04, 22.92, 12.75, 22.6, 22.33],
                   index = ['2008', '2002', '2010', '2011', '2012'])
m2 = pd.Series([8.18, 18.19, 9.13, 7.87, 6.69],
               index = ['2008', '2002', '2010', '2011', '2012'])
hprice.corr(m2) # 0.29435037215132426【弱相关】

import matplotlib.pyplot as plt
plt.plot(hprice.index, hprice, m2.index, m2)
plt.show()