『Python核心技术与实战』pandas.DataFrame()函数介绍-CFANZ编程社区

pandas.DataFrame()函数介绍！

文章目录

一. 创建DataFrame

1.1. numpy创建
1.2. 直接创建
1.3. 字典创建

二. DataFrame属性

2.1. 查看列的数据类型
2.2. 查看DataFrame的头尾
2.3. 查看行名与列名
2.4. 查看数据值.values
2.5. 查看行列数
2.6. 切片和索引

三. DataFrame操作

3.1. 转置
3.2. 描述性统计
3.3. 运算之求和、数乘及平方
3.4. 新增列
3.5. 两个DataFrame合并
3.6. ==去重(重复行)==
3.7. ==dropna()删除缺失值==

三. Pandas中时间序列的处理大全
四. Python assert 关键字
五. 特征提取之pd.get_dummies()

5.1. one-hot encoding
5.2. dummy encoding
5.3. pandas的one-hot编码

六. 参考

DataFrame是Python中Pandas库中的一种数据结构，它类似excel，是一种二维表。DataFrame的单元格可以存放数值、字符串等，这和excel表很像，同时DataFrame可以设置列名columns与行名index。

一. 创建DataFrame

1.1. numpy创建

使用numpy函数创建： index和columns这两个参数是可选的，你可以选择不设置，而且这两个list是可以一样的。

import pandas as pd
import numpy as np

print(list("abc"))
df1 = pd.DataFrame(np.random.randn(3, 3), index=list("abc"), columns=list("ABC"))

1.2. 直接创建

直接创建：

df2 = pd.DataFrame([[1, 2, 3],
                    [2, 3, 4],
                    [3, 4, 5]], index=list("abc"), columns=list("ABC"))

『Python核心技术与实战』pandas.DataFrame()函数介绍_DataFrame

1.3. 字典创建

使用字典创建：

import pandas as pd
import numpy as np

dict1 = {"name":["张三", "李四", "王二"],
         "age":[22, 44, 35],
         "gender":["男", "女", "男"]}
df3 = pd.DataFrame(dict1)

『Python核心技术与实战』pandas.DataFrame()函数介绍_pandas_02

二. DataFrame属性

2.1. 查看列的数据类型

df3.dtypes

『Python核心技术与实战』pandas.DataFrame()函数介绍_pandas_03

2.2. 查看DataFrame的头尾

使用head可以查看前几行的数据，默认的是前5行，不过也可以自己设置。
使用tail可以查看后几行的数据，默认也是5行，参数可以自己设置。

import numpy as np
import pandas as pd
data = np.random.randn(6, 4)
df = pd.DataFrame(data, index=list("abcdef"), columns=list("ABCD"))
# df.head()
df.head(2)
# df.tail() 
df.tail(2)

『Python核心技术与实战』pandas.DataFrame()函数介绍_开发语言_04

2.3. 查看行名与列名

df.index
df.columns

『Python核心技术与实战』pandas.DataFrame()函数介绍_开发语言_05

2.4. 查看数据值.values

使用values可以查看DataFrame里的数据值，返回的是一个ndarray(转成numpy类型)。

df.values

比如说查看某一列所有的数据值。

df['B'].values

如果查看某一行所有的数据值。使用iloc查看数据值（但是好像只能根据行来查看？），iloc是根据数字索引（也就是行号）。

df.iloc[0]

『Python核心技术与实战』pandas.DataFrame()函数介绍_开发语言_06

2.5. 查看行列数

df.shape[0]
df.shape[1]
df.shape

2.6. 切片和索引

使用冒号进行切片。
切片表示的是行切片
索引表示的是列索引

『Python核心技术与实战』pandas.DataFrame()函数介绍_python_07

三. DataFrame操作

3.1. 转置

直接字母T，线性代数上线。

『Python核心技术与实战』pandas.DataFrame()函数介绍_DataFrame_08

3.2. 描述性统计

df.describe()

『Python核心技术与实战』pandas.DataFrame()函数介绍_DataFrame_09

如果有的列是非数值型的，那么就不会进行统计。
如果想对行进行描述性统计，转置后再进行describe。

3.3. 运算之求和、数乘及平方

使用sum()默认对每列求和，sum(1)为对每行求和。

df.sum()   # sum()每列求和
df.sum(1)  # sum(1)为对每行求和

『Python核心技术与实战』pandas.DataFrame()函数介绍_pandas_10

数乘运算使用apply

df.apply(lambda x: x*2)

『Python核心技术与实战』pandas.DataFrame()函数介绍_后端_11

平方运算跟matlab类似，直接使用两个*。

df**2

『Python核心技术与实战』pandas.DataFrame()函数介绍_DataFrame_12

3.4. 新增列

扩充列可以直接像字典一样，列名对应一个list，但是注意list的长度要跟index的长度一致。

『Python核心技术与实战』pandas.DataFrame()函数介绍_python_13

3.5. 两个DataFrame合并

使用join可以将两个DataFrame合并，但只根据行列名合并，并且以作用的那个DataFrame的为基准。

『Python核心技术与实战』pandas.DataFrame()函数介绍_python_14

但是，join这个方法还有how这个参数可以设置，合并两个DataFrame的交集或并集。参数为’inner’表示交集，'outer’表示并集。

『Python核心技术与实战』pandas.DataFrame()函数介绍_python_15

如果要合并多个Dataframe，可以用list把几个Dataframe装起来，然后使用concat转化为一个新的Dataframe。

df10 = pd.DataFrame([1, 2, 3, 4, 5, 6], 
          index=list('ABCDEF'), columns=['a'])
df11 = pd.DataFrame([10, 20, 30, 40, 50, 60],
                    index=list('ABCDEF'), columns=['b'])
df12 = pd.DataFrame([100, 200, 300, 400, 500, 600],
                    index=list('ABCDEF'), columns=['c'])
list1 = [df10.T, df11.T, df12.T]
df13 = pd.concat(list1)
df13

『Python核心技术与实战』pandas.DataFrame()函数介绍_后端_16

3.6. 去重(重复行)

df.drop_duplicates(subset=None,
                   keep='first',
                   inplace=False
                   )

subset：指定是哪些列重复。
keep：去重后留下第几行，{‘first’, ‘last’, False}, default ‘first’｝，如果是False，则去除全部重复的行。
inplace：是否作用于原来的df。

df14 = pd.DataFrame(data=[[1, 2, 3],
                          [1, 2, 4],
                          [1, 2, 4],
                          [1, 2, 3],
                          [1, 2, 5],
                          [1, 2, 5]],
                    index=list('ABCDEF'),
                    columns=['a', 'b', 'c'])

去除重复行,保留重复行中最后一行

df14.drop_duplicates(keep='last')

去除’c’列中有重复的值所在的行

df14.drop_duplicates(subset=('c',))

『Python核心技术与实战』pandas.DataFrame()函数介绍_python_17

3.7. dropna()删除缺失值

dropna()方法，能够找到DataFrame类型数据的空值（缺失值），将空值所在的行/列删除后，将新的DataFrame作为返回值返回。

函数形式：dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)

参数：
axis：轴。0或'index'，表示按行删除；1或'columns'，表示按列删除。
how：筛选方式。‘any’，表示该行/列只要有一个以上的空值，就删除该行/列；‘all’，表示该行/列全部都为空值，就删除该行/列。
thresh：非空元素最低数量。int型，默认为None。如果该行/列中，非空元素数量小于这个值，就删除该行/列。
subset：子集。列表，元素为行或者列的索引。如果axis=0或者‘index’，subset中元素为列的索引；如果axis=1或者‘column’，subset中元素为行的索引。由subset限制的子区域，是判断是否删除该行/列的条件判断区域。
inplace：是否原地替换。布尔值，默认为False。如果为True，则在原DataFrame上进行操作，返回值为None。

# !/usr/bin/env python
# -*- encoding: utf-8 -*-
"""=====================================
@author : kaifang zhang
@time   : 2021/12/28 11:45 AM
@contact: kaifang.zkf@dtwave-inc.com
====================================="""
import pandas as pd

data = [[None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None],
        [None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None],
        [None, None, None, None, None, None, None, None, None, None, '欢迎使用', None, None, None, None, None],
        [None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None],
        [None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None],
        [None, None, None, None, None, None, None, '薪酬绩效数据自助查询系统', None, None, None, None, None, None, None, None],
        [None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None],
        [None, None, None, None, None, None, None, None, None, '最新薪资月的薪酬绩效数据', None, None, None, None, None, None],
        [None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None],
        [None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None],
        [None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None],
        [None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None],
        [None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None],
        [None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None],
        [None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None],
        [None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None],
        [None, None, None, None, None, None, None, None, None, None, '薪酬绩效明细', None, None, None, None, None],
        [None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None],
        [None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None],
        [None, None, None, None, None, None, None, None, None, '当前查询月份：', None, 44197, None, None, None, None],
        [None, None, None, None, None, None, None, None, None, '所在大区：', None, None, None, None, None, None],
        [None, None, None, None, None, None, None, None, None, '请输入：', '系统号', None, None, None, None, None],
        [None, None, None, None, None, None, None, None, None, '请输入：', '身份证后六位', None, None, None, None, None],
        [None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None],
        [None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None],
        [None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None],
        [None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None],
        [None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None],
        [None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None],
        [None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None]]
df_data = pd.DataFrame(data)
print(df_data.shape)

df_data.dropna(axis=0, how='all', inplace=True)
print(df_data.shape)

df_data.dropna(axis=1, how='all', inplace=True)
print(df_data.shape)

Python-pandas的dropna()方法-丢弃含空值的行、列：javascript:void(0)

三. Pandas中时间序列的处理大全

Pandas中时间序列的处理大全：https://www.cnhackhy.com/27337.htm
一文讲解Python时间序列数据的预处理：https://z.itpub.net/article/detail/DC361B898CC85AF1172D9BD09D4236FB

四. Python assert 关键字

判断条件是否返回True：

x = "hello"

#如果condition返回True，则不会发生任何事情:
assert x == "hello"

#如果condition返回False，则引发AssertionError:
assert x == "goodbye"

调试代码时使用assert关键字。assert关键字可以测试代码中的条件是否返回True，否则，程序将引发AssertionError。如果代码返回False，则可以编写一条消息，如下例子：

x = "hello"

#如果condition返回False，则引发AssertionError:
assert x == "goodbye", "x should be 'hello'"

『Python核心技术与实战』pandas.DataFrame()函数介绍_pandas_18

五. 特征提取之pd.get_dummies()

5.1. one-hot encoding

one-hot的基本思想：将离散型特征的每一种取值都看成一种状态，若你的这一特征中有N个不相同的取值，那么我们就可以将该特征抽象成N种不同的状态，one-hot编码保证了每一个取值只会使得一种状态处于“激活态”，也就是说这N种状态中只有一个状态位值为1，其他状态位都是0。举个例子，假设我们以学历为例，我们想要研究的类别为小学、中学、大学、硕士、博士五种类别，我们使用one-hot对其编码就会得到：

『Python核心技术与实战』pandas.DataFrame()函数介绍_后端_19

5.2. dummy encoding

哑变量编码直观的解释就是任意的将一个状态位去除。还是拿上面的例子来说，我们用4个状态位就足够反应上述5个类别的信息，也就是我们仅仅使用前四个状态位 [0,0,0,0] 就可以表达博士了。只是因为对于一个我们研究的样本，他已不是小学生、也不是中学生、也不是大学生、又不是研究生，那么我们就可以默认他是博士，是不是。（额，当然他现实生活也可能上幼儿园，但是我们统计的样本中他并不是，^-）。所以，我们用哑变量编码可以将上述5类表示成：

『Python核心技术与实战』pandas.DataFrame()函数介绍_后端_20