第1章 实验环境的搭建
本章将主要介绍Anaconda和Jupyter Notebook。包括如何在windows,Mac,linux等平台上安装Anaconda,以及Jupyter Notebook的基本启动使用方法。
1-1 导学视频
数学科学和机器学习
数学科学工作流
课程具体安排:
- 第一章:实验环境的搭建
- 第二章:Numpy入门
- 第三章:Pandas入门
- 第四章:Pandas玩转数据
- 第五章:绘图与可视化-Matplotlib
- 第六章:绘图与可视化-Seaborn
- 第七章:数据分析项目实战
- 第八章:总结
适合人群:
- 有一定的自学和动手能力
- 有最基本的Python基础
- 将来想从事数据分析和机器学习相关领域工作
1-2 Anaconda和Jupyter notebook介绍
Anaconda/Jupyter notebook:open Data Science Platform
Anaconda是什么?
- 最著名的Python数据科学平台
- 750+流行的Python&R包
- 跨平台:Windows,Mac,Linux
- conda:可扩展的包管理工具
- 免费分发
- 非常活跃的社区
Anaconda的安装
下载地址
- 现在:https://www.anaconda.com/products/individual
- 之前:https://www.anaconda.com/download/
检查安装是否正确:
cd ~/anaconda
bin/conda --version
conda 4.3.21
Conda: Package和Environment管理
- 安装Packages
- 更新Packages
- 创建沙盒:Conda environment
Conda的Environment管理
创建一个新的environment
conda create --name python34 python3.4
激活一个environment
activate python34 # for Windows
source activate python34 # for Linux & Mac
退出一个environment
deactivate python34 # for Windows
source deactivate python34 # for Linux & Mac
删除一个environment
conda remove --name python34 --all
Conda的package管理
Conda的包管理有点类似pip
安装一个Python包
conda install numpy
查看已安装的Python包
conda list
conda list -n python34 #查看指定环境安装的Python包
删除一个Python包
conda remove --name python34 numpy
Data Science IDE vs Developer IDE
Data Science IDEs in Anaconda
从IPython到Jupyter
什么是Ipython?
- 一个强大的交互式shell
- 是Jupyter的kernel
- 支持交互式数据分析和可视化
Ipython Kernel
- 主要负责运行用户代码
- 通过stdin/stdout和Ipython shell交互
- 用json message通过ZeroMQ和notebook交互
什么是Jupyter Notebook?
- 前身是Ipython notebook
- 一个开源的Web application
- 可以创建和分享包含代码、视图、注释的文档
- 可以用于数据统计、分析、建模、机器学习等领域
Notebook和kernel之间的交互
- 核心是Notebook server
- Notebook server 加载和保存 notebook
Notebook的文件格式(.ipynb)
- 由IPython Notebook 定义的一种格式(json)
- 可以读取在线数据,CSV/XLS文件
- 可以转换成其他格式(py,html,pdf,md等)
NBViewer
- 一个online的ipynb格式notebook展示工具
- 可以通过url分享
- Github集成了NBViewer
- 通过转换器轻松集成到Blogs Emails、Wikis、Books
本课程实验室环境
- 在Windows/Mac/Linux上安装Anaconda
- 使用Python3.6作为基础环境
- 使用Jupyter Notebook作为编程IDE
1-3 Anaconda在Mac上的安装演示
下载macOS版本安装包,Python3.6+64位版本(截止2022/2/15,Python3.9)
Anaconda3-2021.11-MacOSX-x86_64.pkg
选择Install for me only,其他基本默认选项
不建议改变安装目录(安装需1.44GB)
~] ls
~] pwd
~] cd anaconda/
anaconda] ls
anaconda] cd bin
bin] ./conda --version
conda 4.3.21
bin] ./conda list
bin] ./jupyter notebook # 打开浏览器
1-4 Anaconda在windows上安装演示
下载Windows版本安装包,Python3.6+64位版本(截止2022/2/15,Python3.9)
Anaconda3-2021.11-Windows-x86_64.exe
选择Just Me(recommended),其他基本默认选项
在【开始菜单】里可看到安装好的Anaconda3
打开Jupyter Notebook
1-5 Anaconda在Linux上的安装演示
下载Linux版本安装包,Python3.6+64位版本(截止2022/2/15,Python3.9)
复制安装包链接
~] wget https://repo.anaconda.com/archive/Anaconda3-2021.11-Linux-x86_64.sh
~] ls
Anaconda3-2021.11-Linux-x86_64.sh
~] ls -lh
~] sh Anaconda3-2021.11-Linux-x86_64.sh # 选择默认选项
~] pwd
/home/centos
~] cd anaconda3
anaconda3] ls
anaconda3] cd bin
anaconda3] ./conda --version
conda 4.3.21
anaconda3] ./jupyter notebook --no-browser # 复制链接
本地终端
~ ssh -N -f -L localhost:888:localhost:8888 gitlab-demo-ci
~ ssh -N -f -L localhost:888:localhost:8888 root@gitlab-demo-ci
浏览器打开,链接复制进去
!ifconfig # 对应linux系统中 ifconfig
1-6 Jupyter-notebook的使用演示
cd anaconda3
cd jupyter-notebook/python-data-science
python-data-science git:(master) ls
README.md demo.ipynb
python-data-science git:(master) xx/bin/jupyter notebook # 可打开
第2章 Numpy入门
本章将介绍Python数据科学领域里最基础的一个库——Numpy,回顾矩阵运算基础,介绍最重要的数据结构Array以及如何通过Numpy进行数组和矩阵运算。
2-1 数据科学领域5个常用Python库
- Numpy
- Scipy
- Pandas
- Matplotlib
- Scikit-learn
Numpy
- N维数组(矩阵),快速高效,矢量属性运算
- 高效的Index,不需要循环
- 开源免费跨平台,运行效率足以和C/Matlab媲美
Scipy
- 依赖于Numpy
- 专为科学和工程设计
- 实现了多种常用科学计算:线性代数,傅里叶变换,信号和图像处理
Pandas
- 结构化数据分析利器(依赖Numpy)
- 提供了多种高级数据结构:Time-Series,DataFrame,Panel
- 强大的数据索引和处理能力
Matplotlib
- Python 2D绘图领域使用最广泛的套件
- 基本能取代Matlab的绘图功能(散点,曲线,柱形等)
- 通过mplot3d可以绘制精美的3D图
Scikit-learn
- 机器学习的Python模块
- 建立在Scipy之上,提供了常用的机器学习算法:聚类,回归
- 简单易学的API接口
2-2 数学基础回顾之矩阵运算
基本概念
- 矩阵:矩形的数据,即二维数组。其中向量和标量都是矩阵的特例
- 向量:是指1xn或者nx1的矩阵
- 标量:1x1的矩阵
- 数组:N维的数组,时矩阵的延伸
特殊矩阵
- 全0全1矩阵
- 单位矩阵
矩阵加减运算
- 相加、减的两个矩阵必须要有相同的行和列
- 行和列对应元素相加减
数组乘法(点乘)
- 数组乘法(点乘)是对应元素之间的乘法
矩阵乘法
设A为mxp的矩阵,B为pxn的矩阵,mxn的矩阵C为A与B的乘积,记为C=AB,其中矩阵C中的第i行第j列元素可以表示为:
其他线性代数知识
- 清华大学出版的线性代数
- http://bs.szu.edu.cn/sljr/Up/day_110824/201108240409437708.pdf
2-3 Array的创建及访问
Jupyter notebook 新建文件 Array.ipynb
# 数组的创建和访问
import numpy as np
# create from python list
list_1 = [1, 2, 3, 4]
list_1 # [1, 2, 3, 4]
array_1 = np.array(list_1)
array_1 # array([1, 2, 3, 4])
list_2 = [5, 6, 7, 8]
array_2 = np.array([list_1,list_2])
array_2
# array([[1, 2, 3, 4],
[5, 6, 7, 8]])
array_2.shape # (2, 4)
array_2.size # 8
array_2.dtype # dtype('int32') 看电脑,也可能是dtype('int64')
array_3 = np.array([[1.0,2,3],[4.0,5,6]])
array_3.dtype # dtype('float64')
array_4 = np.arange(1,10)
array_4 # array([1, 2, 3, 4, 5, 6, 7, 8, 9])
array_4 = np.arange(1, 10, 2)
array_4 # array([1, 3, 5, 7, 9])
np.zeros(5) # array([0., 0., 0., 0., 0.]) # 零矩阵
np.zeros([2,3]) # 两行三列的二维零矩阵
# array([[0., 0., 0.],
[0., 0., 0.]])
np.eye(5) # n=5的单位矩阵
# array([[1., 0., 0., 0., 0.],
[0., 1., 0., 0., 0.],
[0., 0., 1., 0., 0.],
[0., 0., 0., 1., 0.],
[0., 0., 0., 0., 1.]])
np.eye(5).dtype # dtype('float64')
a = np.arange(1,10)
a # array([1, 2, 3, 4, 5, 6, 7, 8, 9])
a[1] # 2(取数组第2个元素)
a[1:5] # array([2, 3, 4, 5]) 取数组第2-5个元素
b = np.array([[1,2,3],[4,5,6]])
b
# array([[1, 2, 3],
[4, 5, 6]])
b[1][0] # 4
b[1,0] # 4
c = np.array([[1,2,3],[4,5,6],[7,8,9]])
c
# array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
c[:2,1:]
# array([[2, 3],
[5, 6]])
2-4 数组与矩阵运算
Jupyter notebook 新建文件 数组与矩阵运算.ipynb
# 快速创建数组
import numpy as np
np.random.randn(10) # 返回10个小数元素的一维数组
# array([ 0.26674666, -0.91111093, 0.30684449, -0.80206634, -0.89176532,
0.7950014 , -0.53259808, -0.09981816, 1.2960139 , -0.9668373 ])
np.random.randint(10) # 0
np.random.randint(10,size=(2,3)) # 生成一个2x3的二维数组,数组元素[0,9]
# array([[7, 5, 8],
[1, 5, 8]])
np.random.randint(10,size=20) # 生成20个元素的一维数组,数组元素[0,9]
# array([5, 6, 4, 8, 0, 9, 6, 2, 2, 9, 2, 1, 4, 6, 1, 5, 8, 2, 3, 4])
np.random.randint(10,size=20).reshape(4,5) # 对生成20个元素的一维数组进行重塑成4x5的二维数组,数组元素[0,9]
# array([[7, 1, 0, 5, 7],
[8, 0, 3, 7, 9],
[9, 0, 7, 3, 2],
[9, 1, 5, 8, 7]])
# 数组运算
a = np.random.randint(10,size=20).reshape(4,5)
b = np.random.randint(10,size=20).reshape(4,5)
a
# array([[2, 3, 8, 4, 8],
[0, 7, 9, 9, 9],
[1, 8, 1, 8, 6],
[3, 4, 7, 5, 1]])
b
# array([[8, 4, 3, 1, 6],
[4, 4, 6, 2, 9],
[9, 4, 8, 5, 8],
[6, 2, 5, 5, 8]])
a + b
# array([[10, 7, 11, 5, 14],
[ 4, 11, 15, 11, 18],
[10, 12, 9, 13, 14],
[ 9, 6, 12, 10, 9]])
a - b
# array([[-6, -1, 5, 3, 2],
[-4, 3, 3, 7, 0],
[-8, 4, -7, 3, -2],
[-3, 2, 2, 0, -7]])
a * b
# array([[16, 12, 24, 4, 48],
[ 0, 28, 54, 18, 81],
[ 9, 32, 8, 40, 48],
[18, 8, 35, 25, 8]])
a / b
# 可能会报错,看b里是否有元素0
array([[0.25 , 0.75 , 2.66666667, 4. , 1.33333333],
[0. , 1.75 , 1.5 , 4.5 , 1. ],
[0.11111111, 2. , 0.125 , 1.6 , 0.75 ],
[0.5 , 2. , 1.4 , 1. , 0.125 ]])
np.mat([[1,2,3],[4,5,6]])
# matrix([[1, 2, 3],
[4, 5, 6]])
a
# array([[2, 3, 8, 4, 8],
[0, 7, 9, 9, 9],
[1, 8, 1, 8, 6],
[3, 4, 7, 5, 1]])
np.mat(a)
#
matrix([[2, 3, 8, 4, 8],
[0, 7, 9, 9, 9],
[1, 8, 1, 8, 6],
[3, 4, 7, 5, 1]])
# 矩阵的运算
A = np.mat(a)
B = np.mat(b)
A
# matrix([[2, 3, 8, 4, 8],
[0, 7, 9, 9, 9],
[1, 8, 1, 8, 6],
[3, 4, 7, 5, 1]])
B
# matrix([[8, 4, 3, 1, 6],
[4, 4, 6, 2, 9],
[9, 4, 8, 5, 8],
[6, 2, 5, 5, 8]])
A + B
# matrix([[10, 7, 11, 5, 14],
[ 4, 11, 15, 11, 18],
[10, 12, 9, 13, 14],
[ 9, 6, 12, 10, 9]])
A - B
# matrix([[-6, -1, 5, 3, 2],
[-4, 3, 3, 7, 0],
[-8, 4, -7, 3, -2],
[-3, 2, 2, 0, -7]])
A * B # 报错,A的列数和B的行数不一致
a = np.mat(np.random.randint(10,size=20).reshape(4,5))
b = np.mat(np.random.randint(10,size=20).reshape(5,4))
a
# matrix([[9, 9, 3, 0, 5],
[9, 4, 6, 4, 5],
[9, 0, 7, 0, 9],
[7, 2, 6, 0, 6]])
b
# matrix([[2, 2, 6, 4],
[8, 9, 8, 0],
[2, 1, 3, 9],
[3, 1, 0, 2],
[9, 3, 1, 4]])
a * b
# matrix([[141, 117, 140, 83],
[119, 79, 109, 118],
[113, 52, 84, 135],
[ 96, 56, 82, 106]])
# Array常用函数
a = np.random.randint(10,size=20).reshape(4,5)
np.unique(a) # 对a中所有元素去重
# array([0, 1, 2, 3, 4, 5, 6, 8, 9])
a
# array([[4, 2, 8, 4, 2],
[6, 9, 6, 4, 0],
[9, 2, 6, 9, 0],
[1, 3, 8, 5, 9]])
sum(a) # a中所有行列求和
# array([20, 16, 28, 22, 11])
sum(a[0]) # a中第一行求和
# 20
sum(a[:,0]) # a中第一列求和
# 20
a.max() # a中最大值
# 9
max(a[0]) # a中第一行最大值
# 8
max(a[:,0]) # a中第一列最大值
# 9
2-5 Array的input和output
Jupyter notebook 新建文件 Array的input和output.ipynb
# 使用pickle序列化numpy array
import pickle
import numpy as np
x = np.arange(10)
x
# array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
f = open('x.pk1','wb')
pickle.dump(x, f)
!ls # windows系统可用!dir
# Array.ipynb Array的input和output.ipynb
x.pk1 数组与矩阵运算.ipynb
f = open('x.pk1','rb')
pickle.load(f)
# array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
np.save('one_array', x)
!ls
# Array.ipynb Array的input和output.ipynb
x.pk1 one_array.npy
数组与矩阵运算.ipynb
np.load('one_array.npy')
# array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.arange(20)
y
# array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19])
np.savez('two_array.npz', a=x, b=y)
!ls
# Array.ipynb two_array.npz
Array的input和output.ipynb x.pk1
one_array.npy 数组与矩阵运算.ipynb
np.load('two_array.npz')
# <numpy.lib.npyio.NpzFile at 0x17033c77df0>
c = np.load('two_array.npz')
c['a']
# array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
c['b']
# array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19])
scipy文档
- 现在:https://docs.scipy.org/doc/scipy/getting_started.html
- 之前:https://docs.scipy.org/doc/numpy-dev/user/quickstart.html
第3章 Pandas入门
本章将介绍Python数据科学领域用于数据分析最重要的一个库——Pandas。将从pandas里最重要的两种数据结构Series和DataFrame开始,介绍其创建和基本操作,通过实际操作理解Series和DataFrame的关系。
3-1 Pandas Series
Jupyter notebook 新建文件 Series.ipynb
import numpy as np
import pandas as pd
s1 = pd.Series([1,2,3,4])
s1
# 0 1
1 2
2 3
3 4
dtype: int64
s1.values
# array([1, 2, 3, 4], dtype=int64)
s1.index
# RangeIndex(start=0, stop=4, step=1)
s2 = pd.Series(np.arange(10))
s2 # 有些电脑 dtype: int64
# 0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
dtype: int32
s3 = pd.Series({'1':1, '2':2, '3':3})
s3
# 1 1
2 2
3 3
dtype: int64
s3.values
# array([1, 2, 3], dtype=int64)
s3.index
# Index(['1', '2', '3'], dtype='object')
s4 = pd.Series([1,2,3,4],index=['A','B','C','D'])
s4
# A 1
B 2
C 3
D 4
dtype: int64
s4.values
# array([1, 2, 3, 4], dtype=int64)
s4.index
# Index(['A', 'B', 'C', 'D'], dtype='object')
s4['A']
# 1
s4[s4>2]
# C 3
D 4
dtype: int64
s4
# A 1
B 2
C 3
D 4
dtype: int64
s4.to_dict()
# {'A': 1, 'B': 2, 'C': 3, 'D': 4}
s5 = pd.Series(s4.to_dict())
s5
# A 1
B 2
C 3
D 4
dtype: int64
index_1 = ['A', 'B', 'C', 'D','E']
s6 = pd.Series(s5,index=index_1)
s6
# A 1.0
B 2.0
C 3.0
D 4.0
E NaN
dtype: float64
pd.isnull(s6)
# A False
B False
C False
D False
E True
dtype: bool
pd.notnull(s6)
# A True
B True
C True
D True
E False
dtype: bool
s6
# A 1.0
B 2.0
C 3.0
D 4.0
E NaN
dtype: float64
s6.name = 'demo'
s6
# A 1.0
B 2.0
C 3.0
D 4.0
E NaN
Name: demo, dtype: float64
s6.index.name = 'demo index'
s6
# demo index
A 1.0
B 2.0
C 3.0
D 4.0
E NaN
Name: demo, dtype: float64
s6.index
# Index(['A', 'B', 'C', 'D', 'E'], dtype='object', name='demo index')
s6.values
# array([ 1., 2., 3., 4., nan])
3-2 Pandas DataFrame
Jupyter notebook 新建文件 DataFrame.ipynb
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import webbrowser
link = 'https://www.tiobe.com/tiobe-index/'
webbrowser.open(link) # 浏览器里打开链接
True
df = pd.read_clipboard() # 复制页面 table里前10条数据,包含表头
df
# 输出
Position Programming Language Ratings
0 21 SAS 0.66% None
1 22 Scratch 0.64% None
2 23 Fortran 0.58% None
3 24 Rust 0.54% None
4 25 (Visual) FoxPro 0.52%
5 26 COBOL 0.42% None
6 27 Dart 0.42% None
7 28 Kotlin 0.41% None
8 29 Lua 0.40% None
9 30 Julia 0.40% None
type(df)
# pandas.core.frame.DataFrame
df.columns
# Index(['Position', 'Programming', 'Language', 'Ratings'], dtype='object')
df.Ratings
#
0 None
1 None
2 None
3 None
4 0.52%
5 None
6 None
7 None
8 None
9 None
Name: Ratings, dtype: object
df_new = DataFrame(df,columns=['Programming','Language'])
df_new
# 输出
Programming Language
0 SAS 0.66%
1 Scratch 0.64%
2 Fortran 0.58%
3 Rust 0.54%
4 (Visual) FoxPro
5 COBOL 0.42%
6 Dart 0.42%
7 Kotlin 0.41%
8 Lua 0.40%
9 Julia 0.40%
df['Position']
#
0 21
1 22
2 23
3 24
4 25
5 26
6 27
7 28
8 29
9 30
Name: Position, dtype: int64
type(df['Position'])
pandas.core.series.Series
df_new = DataFrame(df,columns=['Programming','Language','Language1'])
df_new
# 输出
Programming Language Language1
0 SAS 0.66% NaN
1 Scratch 0.64% NaN
2 Fortran 0.58% NaN
3 Rust 0.54% NaN
4 (Visual) FoxPro NaN
5 COBOL 0.42% NaN
6 Dart 0.42% NaN
7 Kotlin 0.41% NaN
8 Lua 0.40% NaN
9 Julia 0.40% NaN
# 填充的三种方式
df_new['Language1'] = range(0,10)
# df_new['Language1'] = np.arange(0,10)
# df_new['Language1'] = pd.Series(np.arange(0,10))
df_new
# 输出
Programming Language Language1
0 SAS 0.66% 0
1 Scratch 0.64% 1
2 Fortran 0.58% 2
3 Rust 0.54% 3
4 (Visual) FoxPro 4
5 COBOL 0.42% 5
6 Dart 0.42% 6
7 Kotlin 0.41% 7
8 Lua 0.40% 8
9 Julia 0.40% 9
df_new['Language1'] = pd.Series([100,200], index=[1,2])
df_new
# 输出
Programming Language Language1
0 SAS 0.66% NaN
1 Scratch 0.64% 100.0
2 Fortran 0.58% 200.0
3 Rust 0.54% NaN
4 (Visual) FoxPro NaN
5 COBOL 0.42% NaN
6 Dart 0.42% NaN
7 Kotlin 0.41% NaN
8 Lua 0.40% NaN
9 Julia 0.40% NaN
3-3 深入理解Series和Dataframe
Jupyter notebook 新建文件 深入理解Series和Dataframe.ipynb
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
data = {'Country':['Belgium', 'India', 'Brazil'],
'Capital':['Brussels','New Delhi', 'Brasilia'],
'Population':[11190846, 1303171035, 207847528]}
#Series
s1 = pd.Series(data['Country'])
s1
# 输出
0 Belgium
1 India
2 Brazil
dtype: object
s1.values
# array(['Belgium', 'India', 'Brazil'], dtype=object)
s1.index
# RangeIndex(start=0, stop=3, step=1)
s1 = pd.Series(data['Country'],index=['A','B','C'])
# 输出
A Belgium
B India
C Brazil
dtype: object
s1.values
# array(['Belgium', 'India', 'Brazil'], dtype=object)
s1.index
# Index(['A', 'B', 'C'], dtype='object')
#Dataframe
df1 = pd.DataFrame(data)
df1
# 输出
Country Capital Population
0 Belgium Brussels 11190846
1 India New Delhi 1303171035
2 Brazil Brasilia 207847528
df1['Country']
# 输出
0 Belgium
1 India
2 Brazil
Name: Country, dtype: object
cou = df1['Country']
type(cou)
# pandas.core.series.Series
df1.iterrows()
# <generator object DataFrame.iterrows at 0x0000018DD44C59E0>
for row in df1.iterrows():
print(row),print(type(row)),print(len(row))
# 输出
(0, Country Belgium
Capital Brussels
Population 11190846
Name: 0, dtype: object)
<class 'tuple'>
2
(1, Country India
Capital New Delhi
Population 1303171035
Name: 1, dtype: object)
<class 'tuple'>
2
(2, Country Brazil
Capital Brasilia
Population 207847528
Name: 2, dtype: object)
<class 'tuple'>
2
for row in df1.iterrows():
print(type(row[0]),row[0],row[1])
break
# 输出
<class 'int'> 0 Country Belgium
Capital Brussels
Population 11190846
Name: 0, dtype: object
# <class 'int'> ??
<class 'numpy.int64'>
for row in df1.iterrows():
print(type(row[0]),type(row[1]))
break
# 输出
<class 'int'> <class 'pandas.core.series.Series'>
# <class 'int'> ??
<class 'numpy.int64'>
df1
# 输出
Country Capital Population
0 Belgium Brussels 11190846
1 India New Delhi 1303171035
2 Brazil Brasilia 207847528
data
# 输出
{'Country': ['Belgium', 'India', 'Brazil'],
'Capital': ['Brussels', 'New Delhi', 'Brasilia'],
'Population': [11190846, 1303171035, 207847528]}
s1 = pd.Series(data['Country'])
s2 = pd.Series(data['Capital'])
s3 = pd.Series(data['Population'])
df_new = pd.DataFrame([s1,s2,s3])
df_new
# 输出
0 1 2
0 Belgium India Brazil
1 Brussels New Delhi Brasilia
2 11190846 1303171035 207847528
df1
# 输出
Country Capital Population
0 Belgium Brussels 11190846
1 India New Delhi 1303171035
2 Brazil Brasilia 207847528
df_new = df_new.T
df_new
# 输出
0 1 2
0 Belgium Brussels 11190846
1 India New Delhi 1303171035
2 Brazil Brasilia 207847528
df_new = pd.DataFrame([s1,s2,s3], index=['Country','Capital','Population'])
df_new
# 输出
0 1 2
Country Belgium India Brazil
Capital Brussels New Delhi Brasilia
Population 11190846 1303171035 207847528
df_new = df_new.T
df_new
# 输出
Country Capital Population
0 Belgium Brussels 11190846
1 India New Delhi 1303171035
2 Brazil Brasilia 207847528
3-4 Pandas-Dataframe-IO操作
Jupyter notebook 新建文件 DataFrame IO.ipynb
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
import webbrowser
link = 'http://pandas.pydata.org/pandas-docs/version/0.20/io.html'
webbrowser.open(link) # 打开浏览器,返回True; 复制 网页表格内容
# True
df1 = pd.read_clipboard()
df1
# 输出
Format Type Data Description Reader Writer
0 text CSV read_csv to_csv
1 text JSON read_json to_json
2 text HTML read_html to_html
3 text Local clipboard read_clipboard to_clipboard
4 binary MS Excel read_excel to_excel
5 binary HDF5 Format read_hdf to_hdf
6 binary Feather Format read_feather to_feather
7 binary Msgpack read_msgpack to_msgpack
8 binary Stata read_stata to_stata
9 binary SAS read_sas
10 binary Python Pickle Format read_pickle to_pickle
11 SQL SQL read_sql to_sql
12 SQL Google Big Query read_gbq to_gbq
df1.to_clipboard()
df1
# 输出
Format Type Data Description Reader Writer
0 text CSV read_csv to_csv
1 text JSON read_json to_json
2 text HTML read_html to_html
3 text Local clipboard read_clipboard to_clipboard
4 binary MS Excel read_excel to_excel
5 binary HDF5 Format read_hdf to_hdf
6 binary Feather Format read_feather to_feather
7 binary Msgpack read_msgpack to_msgpack
8 binary Stata read_stata to_stata
9 binary SAS read_sas
10 binary Python Pickle Format read_pickle to_pickle
11 SQL SQL read_sql to_sql
12 SQL Google Big Query read_gbq to_gbq
df1.to_csv('df1.csv')
!ls # windows系统可用 !dir
# DataFrame IO.ipynb df1.csv
!more df1.csv
# 输出
,Format Type,Data Description,Reader,Writer
0,text,CSV,read_csv,to_csv
1,text,JSON,read_json,to_json
2,text,HTML,read_html,to_html
3,text,Local clipboard,read_clipboard,to_clipboard
4,binary,MS Excel,read_excel,to_excel
5,binary,HDF5 Format,read_hdf,to_hdf
6,binary,Feather Format,read_feather,to_feather
7,binary,Msgpack,read_msgpack,to_msgpack
8,binary,Stata,read_stata,to_stata
9,binary,SAS,read_sas,
10,binary,Python Pickle Format,read_pickle,to_pickle
11,SQL,SQL,read_sql,to_sql
12,SQL,Google Big Query,read_gbq,to_gbq
df1.to_csv('df1.csv',index=False) # 去掉索引
!ls
# DataFrame IO.ipynb df1.csv
!more df1.csv
# 输出
Format Type,Data Description,Reader,Writer
text,CSV,read_csv,to_csv
text,JSON,read_json,to_json
text,HTML,read_html,to_html
text,Local clipboard,read_clipboard,to_clipboard
binary,MS Excel,read_excel,to_excel
binary,HDF5 Format,read_hdf,to_hdf
binary,Feather Format,read_feather,to_feather
binary,Msgpack,read_msgpack,to_msgpack
binary,Stata,read_stata,to_stata
binary,SAS,read_sas,
binary,Python Pickle Format,read_pickle,to_pickle
SQL,SQL,read_sql,to_sql
SQL,Google Big Query,read_gbq,to_gbq
df2 = pd.read_csv('df1.csv')
df2
# 输出
Format Type Data Description Reader Writer
0 text CSV read_csv to_csv
1 text JSON read_json to_json
2 text HTML read_html to_html
3 text Local clipboard read_clipboard to_clipboard
4 binary MS Excel read_excel to_excel
5 binary HDF5 Format read_hdf to_hdf
6 binary Feather Format read_feather to_feather
7 binary Msgpack read_msgpack to_msgpack
8 binary Stata read_stata to_stata
9 binary SAS read_sas
10 binary Python Pickle Format read_pickle to_pickle
11 SQL SQL read_sql to_sql
12 SQL Google Big Query read_gbq to_gbq
df1.to_json()
# 输出
'{"Format Type":{"0":"text","1":"text","2":"text","3":"text","4":"binary","5":"binary","6":"binary","7":"binary","8":"binary","9":"binary","10":"binary","11":"SQL","12":"SQL"},"Data Description":{"0":"CSV","1":"JSON","2":"HTML","3":"Local clipboard","4":"MS Excel","5":"HDF5 Format","6":"Feather Format","7":"Msgpack","8":"Stata","9":"SAS","10":"Python Pickle Format","11":"SQL","12":"Google Big Query"},"Reader":{"0":"read_csv","1":"read_json","2":"read_html","3":"read_clipboard","4":"read_excel","5":"read_hdf","6":"read_feather","7":"read_msgpack","8":"read_stata","9":"read_sas","10":"read_pickle","11":"read_sql","12":"read_gbq"},"Writer":{"0":"to_csv","1":"to_json","2":"to_html","3":"to_clipboard","4":"to_excel","5":"to_hdf","6":"to_feather","7":"to_msgpack","8":"to_stata","9":" ","10":"to_pickle","11":"to_sql","12":"to_gbq"}}'
pd.read_json(df1.to_json())
# 输出
Format Type Data Description Reader Writer
0 text CSV read_csv to_csv
1 text JSON read_json to_json
2 text HTML read_html to_html
3 text Local clipboard read_clipboard to_clipboard
4 binary MS Excel read_excel to_excel
5 binary HDF5 Format read_hdf to_hdf
6 binary Feather Format read_feather to_feather
7 binary Msgpack read_msgpack to_msgpack
8 binary Stata read_stata to_stata
9 binary SAS read_sas
10 binary Python Pickle Format read_pickle to_pickle
11 SQL SQL read_sql to_sql
12 SQL Google Big Query read_gbq to_gbq
df1.to_html()
# 输出
'<table border="1" class="dataframe">\n <thead>\n <tr style="text-align: right;">\n <th></th>\n <th>Format Type</th>\n <th>Data Description</th>\n <th>Reader</th>\n <th>Writer</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>text</td>\n <td>CSV</td>\n <td>read_csv</td>\n <td>to_csv</td>\n </tr>\n <tr>\n <th>1</th>\n <td>text</td>\n <td>JSON</td>\n <td>read_json</td>\n <td>to_json</td>\n </tr>\n <tr>\n <th>2</th>\n <td>text</td>\n <td>HTML</td>\n <td>read_html</td>\n <td>to_html</td>\n </tr>\n <tr>\n <th>3</th>\n <td>text</td>\n <td>Local clipboard</td>\n <td>read_clipboard</td>\n <td>to_clipboard</td>\n </tr>\n <tr>\n <th>4</th>\n <td>binary</td>\n <td>MS Excel</td>\n <td>read_excel</td>\n <td>to_excel</td>\n </tr>\n <tr>\n <th>5</th>\n <td>binary</td>\n <td>HDF5 Format</td>\n <td>read_hdf</td>\n <td>to_hdf</td>\n </tr>\n <tr>\n <th>6</th>\n <td>binary</td>\n <td>Feather Format</td>\n <td>read_feather</td>\n <td>to_feather</td>\n </tr>\n <tr>\n <th>7</th>\n <td>binary</td>\n <td>Msgpack</td>\n <td>read_msgpack</td>\n <td>to_msgpack</td>\n </tr>\n <tr>\n <th>8</th>\n <td>binary</td>\n <td>Stata</td>\n <td>read_stata</td>\n <td>to_stata</td>\n </tr>\n <tr>\n <th>9</th>\n <td>binary</td>\n <td>SAS</td>\n <td>read_sas</td>\n <td></td>\n </tr>\n <tr>\n <th>10</th>\n <td>binary</td>\n <td>Python Pickle Format</td>\n <td>read_pickle</td>\n <td>to_pickle</td>\n </tr>\n <tr>\n <th>11</th>\n <td>SQL</td>\n <td>SQL</td>\n <td>read_sql</td>\n <td>to_sql</td>\n </tr>\n <tr>\n <th>12</th>\n <td>SQL</td>\n <td>Google Big Query</td>\n <td>read_gbq</td>\n <td>to_gbq</td>\n </tr>\n </tbody>\n</table>'
df1.to_html('df1.html')
!ls
# DataFrame IO.ipynb df1.csv df1.html
df1.to_excel('df1.xlsx')
3-5 DataFrame的Selecting和indexing
Jupyter notebook 新建文件 Selecting and indexing.ipynb
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
!pwd # pwd 对应windows系统 chdir
# /Users/xxx/xx
!ls /Users/xxx/xx/homework # ls 对应windows系统 dir pwd
# movie_metadata.csv
imdb = pd.read_csv('/Users/xxx/xx/homework/movie_metadata.csv')
imdb
# 输出
color director_name num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_2_name actor_1_facebook_likes gross genres ... num_user_for_reviews language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
0 Color James Cameron 723.0 178.0 0.0 855.0 Joel David Moore 1000.0 760505847.0 Action|Adventure|Fantasy|Sci-Fi ... 3054.0 English USA PG-13 237000000.0 2009.0 936.0 7.9 1.78 33000
1 Color Gore Verbinski 302.0 169.0 563.0 1000.0 Orlando Bloom 40000.0 309404152.0 Action|Adventure|Fantasy ... 1238.0 English USA PG-13 300000000.0 2007.0 5000.0 7.1 2.35 0
2 Color Sam Mendes 602.0 148.0 0.0 161.0 Rory Kinnear 11000.0 200074175.0 Action|Adventure|Thriller ... 994.0 English UK PG-13 245000000.0 2015.0 393.0 6.8 2.35 85000
3 Color Christopher Nolan 813.0 164.0 22000.0 23000.0 Christian Bale 27000.0 448130642.0 Action|Thriller ... 2701.0 English USA PG-13 250000000.0 2012.0 23000.0 8.5 2.35 164000
4 NaN Doug Walker NaN NaN 131.0 NaN Rob Walker 131.0 NaN Documentary ... NaN NaN NaN NaN NaN NaN 12.0 7.1 NaN 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5038 Color Scott Smith 1.0 87.0 2.0 318.0 Daphne Zuniga 637.0 NaN Comedy|Drama ... 6.0 English Canada NaN NaN 2013.0 470.0 7.7 NaN 84
5039 Color NaN 43.0 43.0 NaN 319.0 Valorie Curry 841.0 NaN Crime|Drama|Mystery|Thriller ... 359.0 English USA TV-14 NaN NaN 593.0 7.5 16.00 32000
5040 Color Benjamin Roberds 13.0 76.0 0.0 0.0 Maxwell Moody 0.0 NaN Drama|Horror|Thriller ... 3.0 English USA NaN 1400.0 2013.0 0.0 6.3 NaN 16
5041 Color Daniel Hsia 14.0 100.0 0.0 489.0 Daniel Henney 946.0 10443.0 Comedy|Drama|Romance ... 9.0 English USA PG-13 NaN 2012.0 719.0 6.3 2.35 660
5042 Color Jon Gunn 43.0 90.0 16.0 16.0 Brian Herzlinger 86.0 85222.0 Documentary ... 84.0 English USA PG 1100.0 2004.0 23.0 6.6 1.85 456
5043 rows × 28 columns
imdb.shape
# (5043, 28)
imdb.head()
# 输出
color director_name num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_2_name actor_1_facebook_likes gross genres ... num_user_for_reviews language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
0 Color James Cameron 723.0 178.0 0.0 855.0 Joel David Moore 1000.0 760505847.0 Action|Adventure|Fantasy|Sci-Fi ... 3054.0 English USA PG-13 237000000.0 2009.0 936.0 7.9 1.78 33000
1 Color Gore Verbinski 302.0 169.0 563.0 1000.0 Orlando Bloom 40000.0 309404152.0 Action|Adventure|Fantasy ... 1238.0 English USA PG-13 300000000.0 2007.0 5000.0 7.1 2.35 0
2 Color Sam Mendes 602.0 148.0 0.0 161.0 Rory Kinnear 11000.0 200074175.0 Action|Adventure|Thriller ... 994.0 English UK PG-13 245000000.0 2015.0 393.0 6.8 2.35 85000
3 Color Christopher Nolan 813.0 164.0 22000.0 23000.0 Christian Bale 27000.0 448130642.0 Action|Thriller ... 2701.0 English USA PG-13 250000000.0 2012.0 23000.0 8.5 2.35 164000
4 NaN Doug Walker NaN NaN 131.0 NaN Rob Walker 131.0 NaN Documentary ... NaN NaN NaN NaN NaN NaN 12.0 7.1 NaN 0
5 rows × 28 columns
imdb.tail(10)
# 输出
color director_name num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_2_name actor_1_facebook_likes gross genres ... num_user_for_reviews language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
5033 Color Shane Carruth 143.0 77.0 291.0 8.0 David Sullivan 291.0 424760.0 Drama|Sci-Fi|Thriller ... 371.0 English USA PG-13 7000.0 2004.0 45.0 7.0 1.85 19000
5034 Color Neill Dela Llana 35.0 80.0 0.0 0.0 Edgar Tancangco 0.0 70071.0 Thriller ... 35.0 English Philippines Not Rated 7000.0 2005.0 0.0 6.3 NaN 74
5035 Color Robert Rodriguez 56.0 81.0 0.0 6.0 Peter Marquardt 121.0 2040920.0 Action|Crime|Drama|Romance|Thriller ... 130.0 Spanish USA R 7000.0 1992.0 20.0 6.9 1.37 0
5036 Color Anthony Vallone NaN 84.0 2.0 2.0 John Considine 45.0 NaN Crime|Drama ... 1.0 English USA PG-13 3250.0 2005.0 44.0 7.8 NaN 4
5037 Color Edward Burns 14.0 95.0 0.0 133.0 Caitlin FitzGerald 296.0 4584.0 Comedy|Drama ... 14.0 English USA Not Rated 9000.0 2011.0 205.0 6.4 NaN 413
5038 Color Scott Smith 1.0 87.0 2.0 318.0 Daphne Zuniga 637.0 NaN Comedy|Drama ... 6.0 English Canada NaN NaN 2013.0 470.0 7.7 NaN 84
5039 Color NaN 43.0 43.0 NaN 319.0 Valorie Curry 841.0 NaN Crime|Drama|Mystery|Thriller ... 359.0 English USA TV-14 NaN NaN 593.0 7.5 16.00 32000
5040 Color Benjamin Roberds 13.0 76.0 0.0 0.0 Maxwell Moody 0.0 NaN Drama|Horror|Thriller ... 3.0 English USA NaN 1400.0 2013.0 0.0 6.3 NaN 16
5041 Color Daniel Hsia 14.0 100.0 0.0 489.0 Daniel Henney 946.0 10443.0 Comedy|Drama|Romance ... 9.0 English USA PG-13 NaN 2012.0 719.0 6.3 2.35 660
5042 Color Jon Gunn 43.0 90.0 16.0 16.0 Brian Herzlinger 86.0 85222.0 Documentary ... 84.0 English USA PG 1100.0 2004.0 23.0 6.6 1.85 456
10 rows × 28 columns
imdb['color']
# 输出
0 Color
1 Color
2 Color
3 Color
4 NaN
...
5038 Color
5039 Color
5040 Color
5041 Color
5042 Color
Name: color, Length: 5043, dtype: object
imdb['color'][0]
# 'Color'
imdb['color'][1]
# 'Color'
imdb[['color','director_name']]
# 输出
color director_name
0 Color James Cameron
1 Color Gore Verbinski
2 Color Sam Mendes
3 Color Christopher Nolan
4 NaN Doug Walker
... ... ...
5038 Color Scott Smith
5039 Color NaN
5040 Color Benjamin Roberds
5041 Color Daniel Hsia
5042 Color Jon Gunn
5043 rows × 2 columns
sub_df = imdb[['director_name','movie_title','imdb_score']]
sub_df
# 输出
director_name movie_title imdb_score
0 James Cameron Avatar 7.9
1 Gore Verbinski Pirates of the Caribbean: At World's End 7.1
2 Sam Mendes Spectre 6.8
3 Christopher Nolan The Dark Knight Rises 8.5
4 Doug Walker Star Wars: Episode VII - The Force Awakens ... 7.1
... ... ... ...
5038 Scott Smith Signed Sealed Delivered 7.7
5039 NaN The Following 7.5
5040 Benjamin Roberds A Plague So Pleasant 6.3
5041 Daniel Hsia Shanghai Calling 6.3
5042 Jon Gunn My Date with Drew 6.6
5043 rows × 3 columns
sub_df.head()
# 输出
director_name movie_title imdb_score
0 James Cameron Avatar 7.9
1 Gore Verbinski Pirates of the Caribbean: At World's End 7.1
2 Sam Mendes Spectre 6.8
3 Christopher Nolan The Dark Knight Rises 8.5
4 Doug Walker Star Wars: Episode VII - The Force Awakens ... 7.1
sub_df.head(5)
# 输出
director_name movie_title imdb_score
0 James Cameron Avatar 7.9
1 Gore Verbinski Pirates of the Caribbean: At World's End 7.1
2 Sam Mendes Spectre 6.8
3 Christopher Nolan The Dark Knight Rises 8.5
4 Doug Walker Star Wars: Episode VII - The Force Awakens ... 7.1
sub_df.iloc[10:20,:]
# 输出
director_name movie_title imdb_score
10 Zack Snyder Batman v Superman: Dawn of Justice 6.9
11 Bryan Singer Superman Returns 6.1
12 Marc Forster Quantum of Solace 6.7
13 Gore Verbinski Pirates of the Caribbean: Dead Man's Chest 7.3
14 Gore Verbinski The Lone Ranger 6.5
15 Zack Snyder Man of Steel 7.2
16 Andrew Adamson The Chronicles of Narnia: Prince Caspian 6.6
17 Joss Whedon The Avengers 8.1
18 Rob Marshall Pirates of the Caribbean: On Stranger Tides 6.7
19 Barry Sonnenfeld Men in Black 3 6.8
sub_df.iloc[10:20,0:2]
# 输出
director_name movie_title
10 Zack Snyder Batman v Superman: Dawn of Justice
11 Bryan Singer Superman Returns
12 Marc Forster Quantum of Solace
13 Gore Verbinski Pirates of the Caribbean: Dead Man's Chest
14 Gore Verbinski The Lone Ranger
15 Zack Snyder Man of Steel
16 Andrew Adamson The Chronicles of Narnia: Prince Caspian
17 Joss Whedon The Avengers
18 Rob Marshall Pirates of the Caribbean: On Stranger Tides
19 Barry Sonnenfeld Men in Black 3
tmp_df = sub_df.iloc[10:20,0:2]
tmp_df
# 输出
director_name movie_title
10 Zack Snyder Batman v Superman: Dawn of Justice
11 Bryan Singer Superman Returns
12 Marc Forster Quantum of Solace
13 Gore Verbinski Pirates of the Caribbean: Dead Man's Chest
14 Gore Verbinski The Lone Ranger
15 Zack Snyder Man of Steel
16 Andrew Adamson The Chronicles of Narnia: Prince Caspian
17 Joss Whedon The Avengers
18 Rob Marshall Pirates of the Caribbean: On Stranger Tides
19 Barry Sonnenfeld Men in Black 3
tmp_df.iloc[2:4,:]
# 输出
director_name movie_title
12 Marc Forster Quantum of Solace
13 Gore Verbinski Pirates of the Caribbean: Dead Man's Chest
tmp_df.loc[15:17,:]
# 输出
director_name movie_title
15 Zack Snyder Man of Steel
16 Andrew Adamson The Chronicles of Narnia: Prince Caspian
17 Joss Whedon The Avengers
tmp_df.loc[15:17,:'movie_title']
# 输出
director_name movie_title
15 Zack Snyder Man of Steel
16 Andrew Adamson The Chronicles of Narnia: Prince Caspian
17 Joss Whedon The Avengers
tmp_df.loc[15:17,:'director_name']
# 输出
director_name
15 Zack Snyder
16 Andrew Adamson
17 Joss Whedon
3-6 Series和Dataframe的Reindexing
Jupyter notebook 新建文件 Reindexing Series and DataFrame.ipynb
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
# series reindex
s1 = Series([1,2,3,4], index=['A','B','C','D'])
s1
# 输出
A 1
B 2
C 3
D 4
dtype: int64
# s1.reindex() # 光标移动到方法上面,按shift+tab,弹出文档,连续按选择文档详细程度
s1.reindex(index=['A','B','C','D','E'])
# 输出
A 1.0
B 2.0
C 3.0
D 4.0
E NaN
dtype: float64
s1.reindex(index=['A','B','C','D','E'],fill_value=0)
# 输出
A 1
B 2
C 3
D 4
E 0
dtype: int64
s1.reindex(index=['A','B','C','D','E'],fill_value=10)
# 输出
A 1
B 2
C 3
D 4
E 10
dtype: int64
s2 = Series(['A','B','C'], index=[1,5,10])
s2
# 输出
1 A
5 B
10 C
dtype: object
s2.reindex(index=range(15))
# 输出
0 NaN
1 A
2 NaN
3 NaN
4 NaN
5 B
6 NaN
7 NaN
8 NaN
9 NaN
10 C
11 NaN
12 NaN
13 NaN
14 NaN
dtype: object
s2.reindex(index=range(15),method='ffill')
# 输出
0 NaN
1 A
2 A
3 A
4 A
5 B
6 B
7 B
8 B
9 B
10 C
11 C
12 C
13 C
14 C
dtype: object
# reindex dataframe
df1 = DataFrame(np.random.rand(25).reshape([5,5]))
df1
# 输出
0 1 2 3 4
0 0.255424 0.315708 0.951327 0.423676 0.975377
1 0.087594 0.192460 0.502268 0.534926 0.423024
2 0.817002 0.113410 0.468270 0.410297 0.278942
3 0.315239 0.018933 0.133764 0.240001 0.910754
4 0.267342 0.451077 0.282865 0.170235 0.898429
df1 = DataFrame(np.random.rand(25).reshape([5,5]), index=['A','B','D','E','F'], columns=['c1','c2','c3','c4','c5'])
df1
# 输出
c1 c2 c3 c4 c5
A 0.278063 0.894546 0.932129 0.178442 0.303684
B 0.186239 0.260677 0.708358 0.275914 0.369878
D 0.786987 0.125907 0.191987 0.338194 0.009877
E 0.192269 0.909661 0.227301 0.343989 0.610203
F 0.503267 0.306472 0.197467 0.063800 0.813786
df1.reindex(index=['A','B','C','D','E','F'])
# 输出
c1 c2 c3 c4 c5
A 0.278063 0.894546 0.932129 0.178442 0.303684
B 0.186239 0.260677 0.708358 0.275914 0.369878
C NaN NaN NaN NaN NaN
D 0.786987 0.125907 0.191987 0.338194 0.009877
E 0.192269 0.909661 0.227301 0.343989 0.610203
F 0.503267 0.306472 0.197467 0.063800 0.813786
df1.reindex(columns=['c1','c2','c3','c4','c5','c6'])
# 输出
c1 c2 c3 c4 c5 c6
A 0.278063 0.894546 0.932129 0.178442 0.303684 NaN
B 0.186239 0.260677 0.708358 0.275914 0.369878 NaN
D 0.786987 0.125907 0.191987 0.338194 0.009877 NaN
E 0.192269 0.909661 0.227301 0.343989 0.610203 NaN
F 0.503267 0.306472 0.197467 0.063800 0.813786 NaN
df1.reindex(index=['A','B','C','D','E','F'],columns=['c1','c2','c3','c4','c5','c6'])
# 输出
c1 c2 c3 c4 c5 c6
A 0.278063 0.894546 0.932129 0.178442 0.303684 NaN
B 0.186239 0.260677 0.708358 0.275914 0.369878 NaN
C NaN NaN NaN NaN NaN NaN
D 0.786987 0.125907 0.191987 0.338194 0.009877 NaN
E 0.192269 0.909661 0.227301 0.343989 0.610203 NaN
F 0.503267 0.306472 0.197467 0.063800 0.813786 NaN
s1
# 输出
A 1
B 2
C 3
D 4
dtype: int64
s1.reindex(index=['A','B'])
# 输出
A 1
B 2
dtype: int64
df1
# 输出
c1 c2 c3 c4 c5
A 0.278063 0.894546 0.932129 0.178442 0.303684
B 0.186239 0.260677 0.708358 0.275914 0.369878
D 0.786987 0.125907 0.191987 0.338194 0.009877
E 0.192269 0.909661 0.227301 0.343989 0.610203
F 0.503267 0.306472 0.197467 0.063800 0.813786
df1.reindex(index=['A','B'])
# 输出
c1 c2 c3 c4 c5
A 0.278063 0.894546 0.932129 0.178442 0.303684
B 0.186239 0.260677 0.708358 0.275914 0.369878
s1
# 输出
A 1
B 2
C 3
D 4
dtype: int64
s1.drop('A')
# 输出
B 2
C 3
D 4
dtype: int64
df1
# 输出
c1 c2 c3 c4 c5
A 0.278063 0.894546 0.932129 0.178442 0.303684
B 0.186239 0.260677 0.708358 0.275914 0.369878
D 0.786987 0.125907 0.191987 0.338194 0.009877
E 0.192269 0.909661 0.227301 0.343989 0.610203
F 0.503267 0.306472 0.197467 0.063800 0.813786
df1.drop('A',axis=0)
# 输出
c1 c2 c3 c4 c5
B 0.186239 0.260677 0.708358 0.275914 0.369878
D 0.786987 0.125907 0.191987 0.338194 0.009877
E 0.192269 0.909661 0.227301 0.343989 0.610203
F 0.503267 0.306472 0.197467 0.063800 0.813786
df1.drop('c1',axis=0)
# 报错,行中没有该字段
df1.drop('c1',axis=1)
# 输出
c2 c3 c4 c5
A 0.894546 0.932129 0.178442 0.303684
B 0.260677 0.708358 0.275914 0.369878
D 0.125907 0.191987 0.338194 0.009877
E 0.909661 0.227301 0.343989 0.610203
F 0.306472 0.197467 0.063800 0.813786
3-7 谈一谈NaN
Jupyter notebook 新建文件 谈一谈NaN.ipynb
# NaN - means Not a Number
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
n = np.nan
type(n)
# float
m = 1
m + n
# nan
# Nan in Series
s1 = Series([1, 2, np.nan, 3, 4], index=['A','B','C','D','E'])
s1
# 输出
A 1.0
B 2.0
C NaN
D 3.0
E 4.0
dtype: float64
s1.isnull()
# 输出
A False
B False
C True
D False
E False
dtype: bool
s1.notnull()
# 输出
A True
B True
C False
D True
E True
dtype: bool
s1
# 输出
A 1.0
B 2.0
C NaN
D 3.0
E 4.0
dtype: float64
s1.dropna()
# 输出
A 1.0
B 2.0
D 3.0
E 4.0
dtype: float64
# Nan in DataFrame
dframe = DataFrame([[1,2,3],[np.nan,5,6],[7,np.nan,9],[np.nan,np.nan,np.nan]])
dframe
# 输出
0 1 2
0 1.0 2.0 3.0
1 NaN 5.0 6.0
2 7.0 NaN 9.0
3 NaN NaN NaN
dframe.isnull()
# 输出
0 1 2
0 False False False
1 True False False
2 False True False
3 True True True
dframe.notnull()
# 输出
0 1 2
0 True True True
1 False True True
2 True False True
3 False False False
df1 = dframe.dropna(axis=0)
df1
# 输出
0 1 2
0 1.0 2.0 3.0
df1 = dframe.dropna(axis=1)
df1
# 输出
0
1
2
3
df1 = dframe.dropna(axis=1,how='any')
df1
# 输出
0
1
2
3
# 输出
df1 = dframe.dropna(axis=0,how='any')
df1
# 输出
0 1 2
0 1.0 2.0 3.0
df1 = dframe.dropna(axis=0,how='all')
df1
# 输出
0 1 2
0 1.0 2.0 3.0
1 NaN 5.0 6.0
2 7.0 NaN 9.0
dframe2 = DataFrame([[1,2,3,np.nan],[2,np.nan,5,6],[np.nan,7,np.nan,9],[1,np.nan,np.nan,np.nan]])
dframe2
# 输出
0 1 2 3
0 1.0 2.0 3.0 NaN
1 2.0 NaN 5.0 6.0
2 NaN 7.0 NaN 9.0
3 1.0 NaN NaN NaN
df2 = dframe2.dropna(thresh=None)
df2
# 输出
0 1 2 3
df2 = dframe2.dropna(thresh=2)
df2
# 输出
0 1 2 3
0 1.0 2.0 3.0 NaN
1 2.0 NaN 5.0 6.0
2 NaN 7.0 NaN 9.0
dframe2
# 输出
0 1 2 3
0 1.0 2.0 3.0 NaN
1 2.0 NaN 5.0 6.0
2 NaN 7.0 NaN 9.0
3 1.0 NaN NaN NaN
dframe2.fillna(value=1)
# 输出
0 1 2 3
0 1.0 2.0 3.0 1.0
1 2.0 1.0 5.0 6.0
2 1.0 7.0 1.0 9.0
3 1.0 1.0 1.0 1.0
dframe2.fillna(value={0:0,1:1,2:2,3:3}) # 列填充
# 输出
0 1 2 3
0 1.0 2.0 3.0 3.0
1 2.0 1.0 5.0 6.0
2 0.0 7.0 2.0 9.0
3 1.0 1.0 2.0 3.0
df1
# 输出
0 1 2
0 1.0 2.0 3.0
1 NaN 5.0 6.0
2 7.0 NaN 9.0
df2
# 输出
0 1 2 3
0 1.0 2.0 3.0 NaN
1 2.0 NaN 5.0 6.0
2 NaN 7.0 NaN 9.0
df1.dropna()
# 输出
0 1 2
0 1.0 2.0 3.0
df1.fillna(1)
# 输出
0 1 2
0 1.0 2.0 3.0
1 1.0 5.0 6.0
2 7.0 1.0 9.0