Gateway是什么?（SpringCloudAlibaba组件）-CFANZ编程社区

文章目录

一、对象创建
二、DataFrame性质
三、数值运算及统计分析
四、缺失值处理
五、合并数据
六、分组和数据透视表
七、多级索引：多用于多维数据
八、高性能的pandas
- 1、eval()和query()用法
- 2、eval()和query()使用时机

一、对象创建

1、Series对象

Series是带标签数据的一维数组

（1）用列表创建

pd.Series(data, index=index, dtype=dtype)
data：数据，可以是列表，字典或numpy数组
index：索引，为可选参数
dtype：数据类型，为可选参数

①示例

import pandas as pd
# index缺省，默认为整数序列
data = pd.Series([2,4,3,6])
print(data)
'''
0    2
1    4
2    3
3    6
dtype: int64
'''

②增加index

import pandas as pd

data = pd.Series([2,4,3,6],index=["a", "b", "c", "d"])
print(data)
'''
a    2
b    4
c    3
d    6
dtype: int64
'''

③增加数据类型

import pandas as pd

data = pd.Series([2,4,3,6],index=["a", "b", "c", "d"],dtype=float)
print(data)
'''
a    2.0
b    4.0
c    3.0
d    6.0
dtype: float64
'''
print(data["c"])    # 3.0
'''

④数据类型可被强制改变

import pandas as pd

data = pd.Series([2,4,"3",6],index=["a", "b", "c", "d"],dtype=float)
print(data)
'''
a    2.0
b    4.0
c    3.0
d    6.0
dtype: float64
'''
print(data["c"])    # 3.0

（2）用一维numpy数组创建

import pandas as pd
import numpy as np

x = np.arange(5)
data = pd.Series(x)
print(data)
'''
0    0
1    1
2    2
3    3
4    4
dtype: int32
'''

（3）用字典创建

默认以键为index，值为data

import pandas as pd

dic = {"x":1,"y":10}
data = pd.Series(dic)
print(data)
'''
x     1
y    10
dtype: int64
'''

字典创建，如果指定index，则会到字典的键中筛选，找不到的，值设为NaN

import pandas as pd

dic = {"x":1,"y":10}
data = pd.Series(dic, index=["x","z"])
print(data)
'''
x    1.0
z    NaN
dtype: float64
'''

（4）data为标量的情况

x    5
z    5
dtype: int64

2、DataFrame对象

DataFrame是带标签数据的多维数组
pd.DataFrame(data, index=index, columns=columns)
data：数据，可以是列表，字典或numpy数组
index：索引，为可选参数
columns：列标签，为可选参数

（1）通过Series对象创建

import pandas as pd

dic = {"beijing":110, "shanghai":370}
popu = pd.Series(dic)
dpopu = pd.DataFrame(popu)
print(dpopu)
'''
            0
beijing   110
shanghai  370
'''

import pandas as pd

dic = {"beijing":110, "shanghai":370}
popu = pd.Series(dic)
dpopu = pd.DataFrame(popu,columns=["icode"])
print(dpopu)
'''
          icode
beijing     110
shanghai    370
'''

（2）通过Series对象字典创建

import pandas as pd

dic1 = {"beijing":2300, "shanghai":2100}
dic2 = {"beijing":110, "shanghai":370}
popu = pd.Series(dic1)
icode = pd.Series(dic2)
data = pd.DataFrame({"population":popu, "icode":icode, "country":"China"})
# country数量不够，会自动补齐
print(data)
'''
          population  icode country
beijing         2300    110   China
shanghai        2100    370   China
'''

（3）通过字典列表对象创建

列表索引作为index，字典键作为columns

import pandas as pd

data = [{"a":i, "b":2*i} for i in range(3)]
print(pd.DataFrame(data))
'''
   a  b
0  0  0
1  1  2
2  2  4
'''

不存在的键，会默认值为NaN

import pandas as pd

data = [{"a":1,"b":2},{"b":3,"c":10}]
print(pd.DataFrame(data))
'''
     a  b     c
0  1.0  2   NaN
1  NaN  3  10.0
'''

（4）通过numpy二维数组创建

import pandas as pd
import numpy as np

data = pd.DataFrame(np.random.randint(10, size=(3,2)), \
                    columns=["foo","bar"],index=["a","b","c"])
print(data)
'''
   foo  bar
a    3    2
b    1    8
c    3    5
'''

二、DataFrame性质

1、属性

（1）values返回numpy数组表示的数据

import pandas as pd
import numpy as np

data = pd.DataFrame(np.random.randint(10, size=(3,2)), \
                    columns=["foo","bar"],index=["a","b","c"])

# print(data)
print(data.values)
'''
[[0 5]
 [7 0]
 [4 8]]
'''

（2）index返回行索引

print(data.index)
'''
Index(['a', 'b', 'c'], dtype='object')
'''

（3）columns返回列索引

print(data.columns)
'''
Index(['foo', 'bar'], dtype='object')
'''

（4）shape返回形状

print(data.shape)   # (3, 2)

（5）size返回大小

print(data.size)   # 6

（6）dtypes返回每列数据类型

print(data.dtypes)
'''
foo    int32
bar    int32
dtype: object
'''

2、索引

（1）获取列

字典式：

import pandas as pd
import numpy as np

data = pd.DataFrame(np.arange(6).reshape(3,2), \
                    columns=["foo","bar"],index=["a","b","c"])

print(data)
'''
   foo  bar
a    0    1
b    2    3
c    4    5
'''
print(data["foo"])
'''
a    0
b    2
c    4
Name: foo, dtype: int32
'''
print(data[["foo", "bar"]])
'''
   foo  bar
a    0    1
b    2    3
c    4    5
'''

对象属性式

print(data.bar)
'''
a    1
b    3
c    5
Name: bar, dtype: int32
'''

（2）获取行

绝对索引 loc

print(data.loc["b"])
'''
foo    2
bar    3
Name: b, dtype: int32
'''

相对索引 iloc

print(data.iloc[1])
'''
foo    2
bar    3
Name: b, dtype: int32
'''
print(data.iloc[[0,2]])
'''
   foo  bar
a    0    1
c    4    5
'''

（3）获取标量

print(data.loc["b","bar"])  # 3
print(data.iloc[0,1])       # 1
print(data.values[0][1])    # 1

（4）Series对象的索引

print(type(data.foo))   # <class 'pandas.core.series.Series'>

print(data.foo["c"]) # 4

3、切片

import pandas as pd
import numpy as np

datas = pd.date_range(start="2019-01-01", periods=6)
print(datas)
'''
DatetimeIndex(['2019-01-01', '2019-01-02', '2019-01-03', '2019-01-04',
               '2019-01-05', '2019-01-06'],
              dtype='datetime64[ns]', freq='D')
'''
df = pd.DataFrame(np.random.randn(6,4), index=datas,columns=["A","B","C","D"])
print(df)
'''
                   A         B         C         D
2019-01-01 -0.593472 -0.526596 -0.663579 -0.475506
2019-01-02  0.029637 -1.542327  1.446231 -0.219709
2019-01-03  0.312669 -0.540142  0.106548 -0.569854
2019-01-04 -0.031100  1.409991 -0.625770  1.349713
2019-01-05 -0.752705 -0.302528  0.043599  0.592143
2019-01-06  0.956202 -0.393068  0.466223 -1.890532
'''

（1）行切片

print(df["2019-01-01":"2019-01-03"])
print(df.loc["2019-01-01":"2019-01-03"])
print(df.iloc[0:3])
'''
                   A         B         C         D
2019-01-01 -0.563258 -0.981668 -0.038098  0.313748
2019-01-02  1.453888 -1.075848  1.452511 -0.562839
2019-01-03  0.797852  0.774357  1.796320  1.337514
'''

（2）列切片

print(df.loc[:, "A":"C"])
print(df.iloc[:,0:3])
'''
                   A         B         C
2019-01-01  0.121463 -2.668285  0.175662
2019-01-02 -0.042151  1.250018  0.964810
2019-01-03  0.641962  0.892863 -0.091651
2019-01-04 -0.381722  0.014011 -0.962964
2019-01-05  1.158018 -0.030124  0.599618
2019-01-06  0.569749 -0.435110 -0.319675
'''

（3）多种多样的取值

行、列同时切片

print(df.loc["2019-01-02":"2019-01-04", "B":"C"])
print(df.iloc[1:4,1:3])
'''
                   B         C
2019-01-02  1.885370  0.439749
2019-01-03 -1.054281  0.271491
2019-01-04 -0.781519 -0.872194
'''

行切片，列分散取值

print(df.loc["2019-01-04":"2019-01-06", ["A","C"]])
print(df.iloc[3:, [0,2]])
'''
                   A         C
2019-01-04  0.057934  0.415995
2019-01-05  0.656228  0.836275
2019-01-06 -0.956402  0.720133
'''

行分散取值，列切片

print(df.loc[["2019-01-04", "2019-01-06"], "C":"D"])
print(df.iloc[[3,5],2:4])
'''
                   C         D
2019-01-04 -0.796464 -1.371296
2019-01-06  2.131938 -1.106263
'''

行、列分散取值

print(df.loc[["2019-01-04","2019-01-06"], ["B", "D"]])
print(df.iloc[[3,5],[1,3]])
'''
                   B         D
2019-01-04 -0.320283  1.346262
2019-01-06 -0.216891 -0.844410
'''

4、布尔索引

print(df[df>0])
'''
                   A         B         C         D
2019-01-01  1.170066       NaN       NaN       NaN
2019-01-02  0.786002  2.158762       NaN       NaN
2019-01-03       NaN       NaN  0.322335  0.602991
2019-01-04       NaN       NaN       NaN       NaN
2019-01-05  0.416069       NaN  0.838723  0.687255
2019-01-06  0.277207  0.086217       NaN       NaN
'''

print(df.A>0)   # 判断A列的元素是否大于0
'''
2019-01-01     True
2019-01-02     True
2019-01-03    False
2019-01-04    False
2019-01-05    False
2019-01-06     True
'''
print(df[df.A>0])
'''
                   A         B         C         D
2019-01-01  0.590420 -1.282202  0.318478  0.415096
2019-01-02  2.072327 -0.121314  1.713179  1.663085
2019-01-06  0.106245 -0.522096  0.417755 -0.524761
'''

isin()方法

df2 = df.copy()
df2["E"] = ["one","one","two","three","four","three"]
print(df2)
'''
                   A         B         C         D      E
2019-01-01 -0.432689  1.960850  0.079677  0.609651    one
2019-01-02  0.026600  0.081690  0.555260 -0.193917    one
2019-01-03  1.346473 -0.249037 -0.398267  1.376942    two
2019-01-04  1.631712 -1.757012 -0.386546 -0.215699  three
2019-01-05  0.802655 -0.033013  0.771480 -1.589764   four
2019-01-06  0.615043 -0.240700  0.678544 -0.838852  three
'''
ind = df2["E"].isin(["two","four"])
print(ind)
'''
2019-01-01    False
2019-01-02    False
2019-01-03     True
2019-01-04    False
2019-01-05     True
2019-01-06    False
'''
print(df2[ind])
'''
                   A         B         C         D     E
2019-01-03  0.704706  0.123659  1.147022  0.104124   two
2019-01-05  0.065825  0.207168  1.425794 -0.267355  four
'''

5、赋值

DateFrame增加新列

s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range("20190101", periods=6))
print(s1)
'''
2019-01-01    1
2019-01-02    2
2019-01-03    3
2019-01-04    4
2019-01-05    5
2019-01-06    6
Freq: D, dtype: int64
'''

df["E"] = s1
print(df)
'''
                   A         B         C         D  E
2019-01-01 -2.192860  1.744378 -0.671842  0.704741  1
2019-01-02  0.125302 -1.141235  1.145471  1.860608  2
2019-01-03  1.462714 -0.632829 -0.046127  0.379126  3
2019-01-04  1.745818 -0.688786  0.574567 -0.900502  4
2019-01-05  0.680510 -0.194625 -1.047654  1.482277  5
2019-01-06  1.627649 -0.205627 -1.003146  0.453174  6
'''

修改赋值

df.loc["2019-01-01", "A"] = 0
df.iloc[0,1] = 0
df["D"] = np.array([5]*len(df)) # 可简化成df["D"] = 5   len(df)返回df的行数
print(df)
'''
                   A         B         C  D
2019-01-01  0.000000  0.000000  1.095675  5
2019-01-02 -2.028600  2.048896 -1.527212  5
2019-01-03  2.149004 -0.904068  0.471809  5
2019-01-04 -0.034528  2.151367 -0.219636  5
2019-01-05 -0.544008 -1.098587 -1.873869  5
2019-01-06 -1.547652 -2.084554 -0.701767  5
'''

修改index和columns

df.index = [i for i in range(len(df))]
df.columns = [i*10 for i in range(df.shape[1])]
print(df)
'''
         0         10        20        30
0 -0.942362  0.191228  0.891761 -0.520997
1 -1.330733 -0.462275 -0.711679  1.503393
2 -0.187491  1.461077  0.557227 -0.798765
3 -0.012331 -1.728701  0.018166  0.659837
4  0.518749  0.776088  2.482731 -0.020565
5  0.475219 -1.025717  1.293841  1.236391
'''

三、数值运算及统计分析

1、数据查看

import pandas as pd
import numpy as np

dates = pd.date_range(start="2019-01-01", periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates,
                  columns=["A","B","C","D"])
print(df)
'''
                   A         B         C         D
2019-01-01 -1.061156  0.591245 -0.885117  1.123434
2019-01-02 -1.142466 -0.807766 -1.519887  0.051029
2019-01-03 -0.739533  1.907320 -1.359995  0.335202
2019-01-04 -0.290423 -1.784109 -1.033240  0.706024
2019-01-05  1.179959  0.660133  0.596361  0.384645
2019-01-06  1.093600 -0.395159 -0.799479 -0.308565
'''

（1）查看前面的行

print(df.head(2))   # 默认显示前5行
'''
                   A         B         C         D
2019-01-01 -1.062086 -1.966453  0.638081  0.922812
2019-01-02  0.683613  1.363954  0.004098  1.308496
'''

（2）查看后面的行

print(df.tail(2))   # 默认显示后5行
'''
                   A         B         C         D
2019-01-05 -0.370315  0.187505 -0.272255  0.296648
2019-01-06  1.393871 -0.341858  0.361288  0.834284
'''

（3）查看总体信息

df.iloc[0, 3] = np.nan  # 将第1行，第4列的值设置为NaN
print(df)
'''
                   A         B         C         D
2019-01-01  0.529576 -0.582373  1.174552       NaN
2019-01-02  1.381525  2.005128 -0.084598 -0.680730
2019-01-03  0.634071 -0.421678 -0.695929  1.936779
2019-01-04 -0.146882  1.434341  0.553859 -0.452890
2019-01-05 -0.257330 -0.119174 -0.859402  0.163590
2019-01-06 -1.684116  0.372460  1.312178 -1.548088
'''
df.info()
'''
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 6 entries, 2019-01-01 to 2019-01-06
Freq: D
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A       6 non-null      float64
 1   B       6 non-null      float64
 2   C       6 non-null      float64
 3   D       5 non-null      float64
dtypes: float64(4)
memory usage: 240.0 bytes
'''

2、numpy通用函数适用于pandas

（1）向量化运算

x = pd.DataFrame(np.arange(4).reshape(1,4))
print(x)
'''
   0  1  2  3
0  0  1  2  3
'''
print(x+5)
'''
   0  1  2  3
0  5  6  7  8
'''
print(np.exp(x))
'''
     0         1         2          3
0  1.0  2.718282  7.389056  20.085537
'''

x = pd.DataFrame(np.arange(4).reshape(1,4))
print(x)
'''
   0  1  2  3
0  0  1  2  3
'''
y = pd.DataFrame(np.arange(4,8).reshape(1,4))
print(y)
'''
   0  1  2  3
0  4  5  6  7
'''
print(x*y)
'''
   0  1   2   3
0  0  5  12  21
'''

（2）矩阵化运算

np.random.seed(42)
x = pd.DataFrame(np.random.randint(10, size=(5,5)))
print(x)
'''
   0  1  2  3  4
0  6  3  7  4  6
1  9  2  6  7  4
2  3  7  7  2  5
3  4  1  7  5  1
4  4  0  9  5  8
'''
print(x.dtypes)
'''
0    int32
1    int32
2    int32
3    int32
4    int32
dtype: object
'''

np.random.seed(42)
x = pd.DataFrame(np.random.randint(10, size=(5,5)))
print(x)
'''
   0  1  2  3  4
0  6  3  7  4  6
1  9  2  6  7  4
2  3  7  7  2  5
3  4  1  7  5  1
4  4  0  9  5  8
'''
z = x.T     # 转置
print(z)
'''
   0  1  2  3  4
0  6  9  3  4  4
1  3  2  7  1  0
2  7  6  7  7  9
3  4  7  2  5  5
4  6  4  5  1  8
'''
print(x.dot(z))
'''
     0    1    2    3    4
0  146  154  126  102  155
1  154  186  117  119  157
2  126  117  136   83  125
3  102  119   83   92  112
4  155  157  125  112  186
'''
print(np.dot(x,z))
'''
[[146 154 126 102 155]
 [154 186 117 119 157]
 [126 117 136  83 125]
 [102 119  83  92 112]
 [155 157 125 112 186]]
'''

执行相同运算，一般来说纯粹的计算在numpy里执行的更快。numpy更侧重于计算，pandas更侧重于数据处理。

（3）广播运算

np.random.seed(42)
x = pd.DataFrame(np.random.randint(10, size=(3,3)), columns=list("ABC"))
print(x)
'''
   A  B  C
0  6  3  7
1  4  6  9
2  2  6  7
'''

按行广播

print(x.iloc[0])
'''
A    6
B    3
C    7
Name: 0, dtype: int32
'''

print(x/x.iloc[0])
'''
          A    B         C
0  1.000000  1.0  1.000000
1  0.666667  2.0  1.285714
2  0.333333  2.0  1.000000
'''

按列广播

print(x.A)
'''
0    6
1    4
2    2
Name: A, dtype: int32
'''

print(x.div(x.A,axis=0))    # 每列都除以A列
'''
     A    B         C
0  1.0  0.5  1.166667
1  1.0  1.5  2.250000
2  1.0  3.0  3.500000
'''

print(x.iloc[0])
'''
A    6
B    3
C    7
Name: 0, dtype: int32
'''

print(x.div(x.iloc[0], axis=1)) # 默认axis=1，即按行计算
'''
          A    B         C
0  1.000000  1.0  1.000000
1  0.666667  2.0  1.285714
2  0.333333  2.0  1.000000
'''

3、新的用法

（1）索引对齐

np.random.seed(42)
x = pd.DataFrame(np.random.randint(0,20,size=(2,2)), columns=list("AB"))
print(x)
'''
    A   B
0   6  19
1  14  10
'''
y = pd.DataFrame(np.random.randint(0,10,size=(3,3)), columns=list("ABC"))
print(y)
'''
   A  B  C
0  7  4  6
1  9  2  6
2  7  4  3
'''

pandas会自动对齐两个对象的索引，没有的值用np.nan表示

print(x+y)
'''
      A     B   C
0  13.0  23.0 NaN
1  23.0  12.0 NaN
2   NaN   NaN NaN
'''

缺省值也可用 fill_value 来填充

print(x.add(y, fill_value=0))
'''
      A     B    C
0  13.0  23.0  6.0
1  23.0  12.0  6.0
2   7.0   4.0  3.0
'''

（2）统计相关

数据种类统计

import pandas as pd
import numpy as np
from collections import Counter

np.random.seed(42)
y = np.random.randint(3, size=10)
print(y)    # [2 0 2 2 0 0 2 1 2 2]

print(np.unique(y)) # [0 1 2]
print(Counter(y))   # Counter({2: 6, 0: 3, 1: 1})

y1 = pd.DataFrame(y,columns=["A"])
print(y1)
'''
   A
0  2
1  0
2  2
3  2
4  0
5  0
6  2
7  1
8  2
9  2
'''
print(np.unique(y1))    # [0 1 2]
print(y1["A"].value_counts())
'''
2    6
0    3
1    1
Name: A, dtype: int64
'''

产生新的结果，并进行排序

import pandas as pd
import numpy as np

population_dict = {
    "BeiJing":2154,
    "ShangHai":2424,
    "ShenZhen":1303,
    "HangZhou":981}
population = pd.Series(population_dict)
GDP_dict = {
    "BeiJing":30320,
    "ShangHai":32680,
    "ShenZhen":24222,
    "HangZhou":13468}
GDP = pd.Series(GDP_dict)
city_info = pd.DataFrame({"population":population,"GDP":GDP})
city_info["per_GDP"] = city_info["GDP"]/city_info["population"]
print(city_info)
'''
          population    GDP    per_GDP
BeiJing         2154  30320  14.076137
ShangHai        2424  32680  13.481848
ShenZhen        1303  24222  18.589409
HangZhou         981  13468  13.728848
'''

①递增排序

print(city_info.sort_values(by="per_GDP"))
'''
          population    GDP    per_GDP
ShangHai        2424  32680  13.481848
HangZhou         981  13468  13.728848
BeiJing         2154  30320  14.076137
ShenZhen        1303  24222  18.589409
'''

②递减排序

print(city_info.sort_values(by="per_GDP", ascending=False))
'''
          population    GDP    per_GDP
ShenZhen        1303  24222  18.589409
BeiJing         2154  30320  14.076137
HangZhou         981  13468  13.728848
ShangHai        2424  32680  13.481848
'''

③按轴排序

data = pd.DataFrame(np.random.randint(20, size=(3,4)),
                    index=[2,1,0],columns=list("CBAD"))
print(data)
'''
    C   B   A   D
2   2   5  19  16
1  14  11   9   4
0   6  18   5  17
'''

print(data.sort_index())    # 行排序
'''
    C   B   A   D
0   6  18   5  17
1  14  11   9   4
2   2   5  19  16
'''
print(data.sort_index(axis=1)) # 列排序
'''
    A   B   C   D
2   3  15   1  14
1  10   7  18   6
0  15  13  11  14
'''
print(data.sort_index(axis=1, ascending=False))
'''
   D   C   B  A
2  3  10   9  6
1  5  11  15  5
0  5   7  16  2
'''

统计方法

np.random.seed(10)
df = pd.DataFrame(np.random.normal(2, 4, size=(6, 4)),
                  columns=list("ABCD"))
print(df)
'''
          A         B         C          D
0  7.326346  4.861116 -4.181601   1.966465
1  4.485344 -0.880342  3.062046   2.434194
2  2.017166  1.301599  3.732105   6.812149
3 -1.860263  6.113096  2.914521   3.780550
4 -2.546409  2.540548  7.938148  -2.319220
5 -5.910913 -4.973489  3.064281  11.539869
'''

# 统计非空个数
print(df.count())
'''
A    6
B    6
C    6
D    6
'''
# 求和
print(df.sum())
'''
A     3.511271
B     8.962527
C    16.529499
D    24.214008
dtype: float64
'''
print(df.sum(axis=1))
'''
0     9.972325
1     9.101242
2    13.863019
3    10.947905
4     5.613067
5     3.719748
dtype: float64
'''

# 最大值 最小值
print(df.min()) # 按列
'''
A   -5.910913
B   -4.973489
C   -4.181601
D   -2.319220
dtype: float64
'''
print(df.max(axis=1))   # 按行
'''
0     7.326346
1     4.485344
2     6.812149
3     6.113096
4     7.938148
5    11.539869
dtype: float64
'''
print(df.idxmax())  # 最大值的坐标
'''
A    0
B    3
C    4
D    5
dtype: int64
'''

# 均值
print(df.mean())
'''
A    0.585212
B    1.493755
C    2.754917
D    4.035668
dtype: float64
'''
# 方差
print(df.var())
'''
A    24.138289
B    16.254343
C    15.230314
D    22.263578
dtype: float64
'''
# 标准差
print(df.std())
'''
A    4.913073
B    4.031668
C    3.902604
D    4.718430
dtype: float64
'''
# 中位数
print(df.median())
'''
A    0.078452
B    1.921073
C    3.063163
D    3.107372
dtype: float64
'''
# 众数
data = pd.DataFrame(np.random.randint(5,size=(10,2)),
                    columns=list("AB"))
print(data)
'''
   A  B
0  2  0
1  3  4
2  2  0
3  1  2
4  0  0
5  3  1
6  3  4
7  1  4
8  2  0
9  0  4
'''
print(data.mode())
'''
   A  B
0  2  0
1  3  4
'''
print(df.quantile(0.75))    # 75%分数位
'''
A    3.868299
B    4.280974
C    3.565149
D    6.054250
Name: 0.75, dtype: float64
'''

print(df.describe())
'''
              A         B         C          D
count  6.000000  6.000000  6.000000   6.000000
mean   0.585212  1.493755  2.754917   4.035668
std    4.913073  4.031668  3.902604   4.718430
min   -5.910913 -4.973489 -4.181601  -2.319220
25%   -2.374872 -0.334857  2.951402   2.083397
50%    0.078452  1.921073  3.063163   3.107372
75%    3.868299  4.280974  3.565149   6.054250
max    7.326346  6.113096  7.938148  11.539869
'''

data2 = pd.DataFrame([
    ["a","a","c","d"],
    ["c","a","c","d"],
    ["a","a","d","c"]],
columns=list("ABCD"))
print(data2)
'''
   A  B  C  D
0  a  a  c  d
1  c  a  c  d
2  a  a  d  c
'''
print(data2.describe())
'''
        A  B  C  D
count   3  3  3  3
unique  2  1  2  2
top     a  a  c  d
freq    2  3  2  2
'''
'''
count 表示每列的数据数量，
unique 表示每列的唯一值数量，
top 表示每列中出现频率最高的值，
freq 表示最常见值的出现频次。
'''

# 相关性系数
print(df.corr())
'''
          A         B         C         D
A  1.000000  0.409966 -0.655007 -0.383420
B  0.409966  1.000000 -0.255655 -0.631457
C -0.655007 -0.255655  1.000000 -0.152966
D -0.383420 -0.631457 -0.152966  1.000000
'''
print(df.corrwith(df["A"]))
'''
A    1.000000
B    0.409966
C   -0.655007
D   -0.383420
dtype: float64
'''

自定义输出
apply(method)的用法：使用method方法默认对每一列进行相应的操作

np.random.seed(10)
df = pd.DataFrame(np.random.normal(2, 4, size=(6, 4)),
                  columns=list("ABCD"))
print(df)
'''
          A         B         C          D
0  7.326346  4.861116 -4.181601   1.966465
1  4.485344 -0.880342  3.062046   2.434194
2  2.017166  1.301599  3.732105   6.812149
3 -1.860263  6.113096  2.914521   3.780550
4 -2.546409  2.540548  7.938148  -2.319220
5 -5.910913 -4.973489  3.064281  11.539869
'''
print(df.apply(np.cumsum))  # 按列方向，累加
'''
           A          B          C          D
0   7.326346   4.861116  -4.181601   1.966465
1  11.811690   3.980774  -1.119555   4.400659
2  13.828856   5.282373   2.612550  11.212808
3  11.968593  11.395469   5.527070  14.993359
4   9.422184  13.936017  13.465218  12.674139
5   3.511271   8.962527  16.529499  24.214008
'''
print(df.apply(np.cumsum, axis=1))  # 按行方向，累加
'''
          A          B         C          D
0  7.326346  12.187462  8.005861   9.972325
1  4.485344   3.605002  6.667048   9.101242
2  2.017166   3.318765  7.050870  13.863019
3 -1.860263   4.252834  7.167354  10.947905
4 -2.546409  -0.005861  7.932287   5.613067
5 -5.910913 -10.884402 -7.820122   3.719748
'''
print(df.apply(sum))
'''
A     3.511271
B     8.962527
C    16.529499
D    24.214008
dtype: float64
'''
print(df.apply(lambda x: x.max()-x.min()))
'''
A    13.237259
B    11.086585
C    12.119749
D    13.859089
dtype: float64
'''

def my_describe(x):
    return pd.Series([x.count(), x.mean(), x.max(),
                      x.idxmin(), x.std()],
                     index=["Count", "mean", "max", "idxmin", "std"])
print(df.apply(my_describe))
'''
               A         B         C          D
Count   6.000000  6.000000  6.000000   6.000000
mean    0.585212  1.493755  2.754917   4.035668
max     7.326346  6.113096  7.938148  11.539869
idxmin  5.000000  5.000000  0.000000   4.000000
std     4.913073  4.031668  3.902604   4.718430
'''

四、缺失值处理

1、发现缺失值

import pandas as pd
import numpy as np

data = pd.DataFrame(np.array([[1, np.nan, 2],
                              [np.nan, 3, 4],
                              [5, 6, None]]),
                    columns=["A", "B", "C"])
print(data)
'''
     A    B     C
0    1  NaN     2
1  NaN    3     4
2    5    6  None
'''

注意：有None、字符串等，数据类型全部变为object，它比int和float更消耗资源

print(data.dtypes)
'''
A    object
B    object
C    object
dtype: object
'''

print(data.isnull())
'''
       A      B      C
0  False   True  False
1   True  False  False
2  False  False   True
'''

print(data.notnull())
'''
       A      B      C
0   True  False   True
1  False   True   True
2   True   True  False
'''

2、删除缺失值

import pandas as pd
import numpy as np

data = pd.DataFrame(np.array([[1, np.nan, 2, 3],
                              [np.nan, 3, 4, 6],
                              [7, 8, np.nan, 9],
                              [10, 11, 12, 13]]),
                    columns=["A", "B", "C", "D"])
print(data)
'''
      A     B     C     D
0   1.0   NaN   2.0   3.0
1   NaN   3.0   4.0   6.0
2   7.0   8.0   NaN   9.0
3  10.0  11.0  12.0  13.0
'''

注意：np.nan是一种特殊的浮点数

print(data.dtypes)
'''
A    float64
B    float64
C    float64
D    float64
dtype: object
'''

（1）删除整行

print(data.dropna())
'''
      A     B     C     D
3  10.0  11.0  12.0  13.0
'''

（2）删除整列

print(data.dropna(axis=1))
'''
      D
0   3.0
1   6.0
2   9.0
3  13.0
'''

data["D"] = np.nan
print(data)
'''
      A     B     C   D
0   1.0   NaN   2.0 NaN
1   NaN   3.0   4.0 NaN
2   7.0   8.0   NaN NaN
3  10.0  11.0  12.0 NaN
'''
print(data.dropna(axis=1, how="all"))
'''
      A     B     C
0   1.0   NaN   2.0
1   NaN   3.0   4.0
2   7.0   8.0   NaN
3  10.0  11.0  12.0
'''

data.loc[3] = np.nan
print(data)
'''
     A    B    C   D
0  1.0  NaN  2.0 NaN
1  NaN  3.0  4.0 NaN
2  7.0  8.0  NaN NaN
3  NaN  NaN  NaN NaN
'''
print(data.dropna(how="all"))
'''
     A    B    C   D
0  1.0  NaN  2.0 NaN
1  NaN  3.0  4.0 NaN
2  7.0  8.0  NaN NaN
'''

3、填充缺失值

import pandas as pd
import numpy as np

data = pd.DataFrame(np.array([[1, np.nan, 2, 3],
                              [np.nan, 3, 4, 6],
                              [7, 8, np.nan, 9],
                              [10, 11, 12, 13]]),
                    columns=["A", "B", "C", "D"])
print(data)
'''
      A     B     C     D
0   1.0   NaN   2.0   3.0
1   NaN   3.0   4.0   6.0
2   7.0   8.0   NaN   9.0
3  10.0  11.0  12.0  13.0
'''

print(data.fillna(value=5))
'''
      A     B     C     D
0   1.0   5.0   2.0   3.0
1   5.0   3.0   4.0   6.0
2   7.0   8.0   5.0   9.0
3  10.0  11.0  12.0  13.0
'''

用均值进行替换

print(data.fillna(value=data.mean()))   # 填充每列的均值
'''
      A          B     C     D
0   1.0   7.333333   2.0   3.0
1   6.0   3.000000   4.0   6.0
2   7.0   8.000000   6.0   9.0
3  10.0  11.000000  12.0  13.0
'''
print(data.fillna(value=data.stack().mean()))   #用这个DataFrame中所有非空数据的均值填充
'''
           A          B          C     D
0   1.000000   6.846154   2.000000   3.0
1   6.846154   3.000000   4.000000   6.0
2   7.000000   8.000000   6.846154   9.0
3  10.000000  11.000000  12.000000  13.0
'''

五、合并数据

构造一个生产DataFrame的函数

import pandas as pd

def make_df(cols, ind):
    data = {c: [str(c)+str(i) for i in ind] for c in cols}
    return pd.DataFrame(data, ind)

print(make_df("ABC", range(3)))
'''
    A   B   C
0  A0  B0  C0
1  A1  B1  C1
2  A2  B2  C2
'''

1、垂直合并

df_1 = make_df("AB", [1, 2])
df_2 = make_df("AB", [3 ,4])
print(df_1)
'''
    A   B
1  A1  B1
2  A2  B2
'''
print(df_2)
'''
    A   B
3  A3  B3
4  A4  B4
'''
print(pd.concat([df_1, df_2]))
'''
    A   B
1  A1  B1
2  A2  B2
3  A3  B3
4  A4  B4
'''

2、水平合并

df_3 = make_df("AB", [0,1])
df_4 = make_df("CD", [0,1])
print(df_3)
'''
    A   B
0  A0  B0
1  A1  B1
'''
print(df_4)
'''
    C   D
0  C0  D0
1  C1  D1
'''
print(pd.concat([df_3, df_4], axis=1))
'''
    A   B   C   D
0  A0  B0  C0  D0
1  A1  B1  C1  D1
'''

3、索引重叠

df_5 = make_df("AB", [1, 2])
df_6 = make_df("AB", [1, 2])
print(df_5)
'''
    A   B
1  A1  B1
2  A2  B2
'''
print(df_6)
'''
    A   B
1  A1  B1
2  A2  B2
'''
print(pd.concat([df_5, df_6]))
'''
    A   B
1  A1  B1
2  A2  B2
1  A1  B1
2  A2  B2
'''
print(pd.concat([df_5, df_6], ignore_index=True))
'''
    A   B
0  A1  B1
1  A2  B2
2  A1  B1
3  A2  B2
'''

4、对齐合并merge()

df_9 = make_df("AB", [1, 2])
df_10 = make_df("BC", [1, 2])
print(df_9)
'''
    A   B
1  A1  B1
2  A2  B2
'''
print(df_10)
'''
    B   C
1  B1  C1
2  B2  C2
'''
print(pd.merge(df_9, df_10))
'''
    A   B   C
0  A1  B1  C1
1  A2  B2  C2
'''

5、例：合并城市信息

import pandas as pd

population_dict = {"city": ("BeiJing", "HangZhou", "ShenZhen"),
                   "pop": (2154, 981,1303)}
population = pd.DataFrame(population_dict)
print(population)
'''
       city   pop
0   BeiJing  2154
1  HangZhou   981
2  ShenZhen  1303
'''
GDP_dict = {"city": ("BeiJing", "ShangHai", "HangZhou"),
            "GDP": (30320, 32680, 13468)}
GDP = pd.DataFrame(GDP_dict)
print(GDP)
'''
       city    GDP
0   BeiJing  30320
1  ShangHai  32680
2  HangZhou  13468
'''
city_info = pd.merge(population, GDP)
print(city_info)
'''
       city   pop    GDP
0   BeiJing  2154  30320
1  HangZhou   981  13468
'''
city_info = pd.merge(population, GDP, how="outer")	# 设置为并集，默认为交集
print(city_info)
'''
       city     pop      GDP
0   BeiJing  2154.0  30320.0
1  HangZhou   981.0  13468.0
2  ShenZhen  1303.0      NaN
3  ShangHai     NaN  32680.0
'''

六、分组和数据透视表

import pandas as pd
import numpy as np

np.random.seed(10)
df = pd.DataFrame({"key":["A", "B", "C", "A", "B", "C"],
                   "data1":range(6),
                   "data2":np.random.randint(0, 10, size=6)})
print(df)
'''
  key  data1  data2
0   A      0      9
1   B      1      4
2   C      2      0
3   A      3      1
4   B      4      9
5   C      5      0
'''

1、分组

（1）延迟计算

print(df.groupby("key"))
# <pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001E5DB95F610>

print(df.groupby("key").sum())
'''
     data1  data2
key              
A        3     10
B        5     13
C        7      0
'''

（2）按列取值

print(df.groupby("key")["data2"].sum())
'''
key
A    10
B    13
C     0
Name: data2, dtype: int32
'''

（3）按组迭代

for data, group in df.groupby("key"):
    print("{0:5} shape={1}".format(data, group.shape))
'''
A     shape=(2, 3)
B     shape=(2, 3)
C     shape=(2, 3)
'''

（4）调用方法

print(df.groupby("key")["data1"].describe())
'''
     count  mean      std  min   25%  50%   75%  max
key                                                 
A      2.0   1.5  2.12132  0.0  0.75  1.5  2.25  3.0
B      2.0   2.5  2.12132  1.0  1.75  2.5  3.25  4.0
C      2.0   3.5  2.12132  2.0  2.75  3.5  4.25  5.0
'''

（5）支持更复杂的操作

print(df.groupby("key").aggregate(["min", "median", "max"]))
'''
    data1            data2           
      min median max   min median max
key                                  
A       0    1.5   3     1    5.0   9
B       1    2.5   4     4    6.5   9
C       2    3.5   5     0    0.0   0
'''

（6）过滤

def filter_func(x):
    return x["data2"].std() > 3

print(df.groupby("key")["data2"].std())
'''
key
A    5.656854
B    3.535534
C    0.000000
Name: data2, dtype: float64
'''
print(df.groupby("key").filter(filter_func))
'''
  key  data1  data2
0   A      0      9
1   B      1      4
3   A      3      1
4   B      4      9
'''

（7）转换

print(df.groupby("key").transform(lambda x: x-x.mean()))
'''
   data1  data2
0   -1.5    4.0
1   -1.5   -2.5
2   -1.5    0.0
3    1.5   -4.0
4    1.5    2.5
5    1.5    0.0
'''

（8）apply()方法

def norm_by_data2(x):
    x["data1"] /= x["data2"].sum()
    return x

print(
    df.groupby("key").apply(norm_by_data2)
)
'''
  key     data1  data2
0   A  0.000000      9
1   B  0.076923      4
2   C       inf      0
3   A  0.300000      1
4   B  0.307692      9
5   C       inf      0
'''

（9）将列表、数组设为分组间

L = [0, 1, 0, 1, 2, 0]
print(df.groupby(L).sum())
'''
   data1  data2
0      7      9
1      4      5
2      4      9
'''

（10）用字典将索引映射到分组

df2 = df.set_index("key")
print(df2)
'''
     data1  data2
key              
A        0      9
B        1      4
C        2      0
A        3      1
B        4      9
C        5      0
'''
mapping = {"A": "first", "B": "constant", "C": "constant"}

print(df2.groupby(mapping).sum())
'''
          data1  data2
key                   
constant     12     13
first         3     10
'''

（11）任意Python函数

print(
    df2.groupby(str.lower).mean()
)
'''
     data1  data2
key              
a      1.5    5.0
b      2.5    6.5
c      3.5    0.0
'''

（12）多个有效值组成的列表

mapping = {"A": "first", "B": "constant", "C": "constant"}

print(
    df2.groupby([str.lower, mapping]).mean()
)
'''
              data1  data2
key key                   
a   first       1.5    5.0
b   constant    2.5    6.5
c   constant    3.5    0.0
'''

（13）例：行星观测数据处理

import seaborn as sns

planets = sns.load_dataset("planets")

# print(planets)
# print(planets.shape)
# print(planets.head())
# print(planets.describe())

decade = 10*(planets["year"]//10)
decade = decade.astype(str) + "s"
decade.name = "decade"
print(decade.head())

# print(planets.groupby(["method", decade]).sum())
print(planets.groupby(["method", decade])[["number"]].sum().unstack().fillna(0))

2、数据透视表

import seaborn as sns

titanic = sns.load_dataset("titanic")
# print(titanic.head())
# print(titanic.describe())
# print(titanic.groupby("sex")[["survived"]].mean())
'''
        survived
sex             
female  0.742038
male    0.188908
'''
# print(titanic.groupby("sex")["survived"].mean())
'''
sex
female    0.742038
male      0.188908
Name: survived, dtype: float64
'''
# print(
#     titanic.groupby(["sex", "class"])["survived"].aggregate("mean").unstack()
# )
'''
class      First    Second     Third
sex                                 
female  0.968085  0.921053  0.500000
male    0.368852  0.157407  0.135447
'''
# 数据透视表
# print(
#     titanic.pivot_table("survived", index="sex", columns="class",
#                         aggfunc="mean", margins=True)
# )
'''
class      First    Second     Third       All
sex                                           
female  0.968085  0.921053  0.500000  0.742038
male    0.368852  0.157407  0.135447  0.188908
All     0.629630  0.472826  0.242363  0.383838
'''
print(
    titanic.pivot_table(index="sex", columns="class",
                        aggfunc={"survived":sum, "fare":"mean"})
)
'''
              fare                       survived             
class        First     Second      Third    First Second Third
sex                                                           
female  106.125798  21.970121  16.118810       91     70    72
male     67.226127  19.741782  12.661633       45     17    47
'''

七、多级索引：多用于多维数据

import pandas as pd
import numpy as np

base_data = np.array([
    [1771, 11115],
    [2154, 30320],
    [2141, 14070],
    [2424, 32680],
    [1077, 7806],
    [1303, 24222],
    [798, 4789],
    [981, 13468]
])

data = pd.DataFrame(base_data, index=[["BeiJing", "BeiJing", "ShangHai", "ShangHai","ShenZhen", "ShenZhen", "HangZhou", "HangZhou"],
                                      [2008, 2018] * 4], columns=["population", "GDP"])
data.index.names = ["city", "year"]
print(data)
'''
               population    GDP
city     year                   
BeiJing  2008        1771  11115
         2018        2154  30320
ShangHai 2008        2141  14070
         2018        2424  32680
ShenZhen 2008        1077   7806
         2018        1303  24222
HangZhou 2008         798   4789
         2018         981  13468
'''
print(data["GDP"])
'''
city      year
BeiJing   2008    11115
          2018    30320
ShangHai  2008    14070
          2018    32680
ShenZhen  2008     7806
          2018    24222
HangZhou  2008     4789
          2018    13468
Name: GDP, dtype: int32
'''
print(data.loc["ShangHai", "GDP"])
'''
year
2008    14070
2018    32680
Name: GDP, dtype: int32
'''
print(data.loc["ShangHai", 2018]["GDP"])    # 32680

八、高性能的pandas

1、eval()和query()用法

减少了符合代数式计算过程中间的内存分配

import pandas as pd
import numpy as np

df1, df2, df3, df4 = (pd.DataFrame(np.random.random((10000,100))) for i in range(4))
print(np.allclose((df1+df2)/(df3+df4),pd.eval("(df1+df2)/(df3+df4)")))  # True

query()用法和eval()相同

2、eval()和query()使用时机

小数组时，普通方法更快。它们适用于大数组。

# 计算 DataFrame df1 中所有元素所占用的内存空间大小，单位为字节（bytes）
print(df1.values.nbytes)    # 8000000