第1章实验环境的搭建

本章将主要介绍Anaconda和Jupyter Notebook。包括如何在windows，Mac，linux等平台上安装Anaconda，以及Jupyter Notebook的基本启动使用方法。

1-1 导学视频

数学科学和机器学习

数学科学工作流

课程具体安排：

第一章：实验环境的搭建
第二章：Numpy入门
第三章：Pandas入门
第四章：Pandas玩转数据
第五章：绘图与可视化-Matplotlib
第六章：绘图与可视化-Seaborn
第七章：数据分析项目实战
第八章：总结

适合人群：

有一定的自学和动手能力
有最基本的Python基础
将来想从事数据分析和机器学习相关领域工作

1-2 Anaconda和Jupyter notebook介绍

Anaconda/Jupyter notebook：open Data Science Platform

Anaconda是什么？

最著名的Python数据科学平台
750+流行的Python&R包
跨平台：Windows，Mac，Linux
conda：可扩展的包管理工具
免费分发
非常活跃的社区

Anaconda的安装

下载地址

现在：https://www.anaconda.com/products/individual
之前：https://www.anaconda.com/download/

检查安装是否正确：

cd ~/anaconda
bin/conda --version
conda 4.3.21

Conda: Package和Environment管理

安装Packages
更新Packages
创建沙盒：Conda environment

Conda的Environment管理
创建一个新的environment

conda create --name python34 python3.4

激活一个environment

activate python34 # for Windows
source activate python34 # for Linux & Mac

退出一个environment

deactivate python34 # for Windows
source deactivate python34 # for Linux & Mac

删除一个environment

conda remove --name python34 --all

Conda的package管理
Conda的包管理有点类似pip
安装一个Python包

conda install numpy

查看已安装的Python包

conda list
conda list -n python34 #查看指定环境安装的Python包

删除一个Python包

conda remove --name python34 numpy

Data Science IDE vs Developer IDE

Data Science IDEs in Anaconda

从IPython到Jupyter

什么是Ipython？

一个强大的交互式shell
是Jupyter的kernel
支持交互式数据分析和可视化

Ipython Kernel

主要负责运行用户代码
通过stdin/stdout和Ipython shell交互
用json message通过ZeroMQ和notebook交互

什么是Jupyter Notebook？

前身是Ipython notebook
一个开源的Web application
可以创建和分享包含代码、视图、注释的文档
可以用于数据统计、分析、建模、机器学习等领域

Notebook和kernel之间的交互

核心是Notebook server
Notebook server 加载和保存 notebook

Notebook的文件格式(.ipynb)

由IPython Notebook 定义的一种格式(json)
可以读取在线数据，CSV/XLS文件
可以转换成其他格式(py,html,pdf,md等)

NBViewer

一个online的ipynb格式notebook展示工具
可以通过url分享
Github集成了NBViewer
通过转换器轻松集成到Blogs Emails、Wikis、Books

本课程实验室环境

在Windows/Mac/Linux上安装Anaconda
使用Python3.6作为基础环境
使用Jupyter Notebook作为编程IDE

1-3 Anaconda在Mac上的安装演示

下载macOS版本安装包，Python3.6+64位版本(截止2022/2/15，Python3.9)
Anaconda3-2021.11-MacOSX-x86_64.pkg
选择Install for me only，其他基本默认选项
不建议改变安装目录(安装需1.44GB)

~] ls
~] pwd
~] cd anaconda/
anaconda] ls
anaconda] cd bin
bin] ./conda --version
conda 4.3.21
bin] ./conda list
bin] ./jupyter notebook # 打开浏览器

1-4 Anaconda在windows上安装演示

下载Windows版本安装包，Python3.6+64位版本(截止2022/2/15，Python3.9)
Anaconda3-2021.11-Windows-x86_64.exe
选择Just Me(recommended)，其他基本默认选项
在【开始菜单】里可看到安装好的Anaconda3
打开Jupyter Notebook

1-5 Anaconda在Linux上的安装演示

下载Linux版本安装包，Python3.6+64位版本(截止2022/2/15，Python3.9)
复制安装包链接

~] wget https://repo.anaconda.com/archive/Anaconda3-2021.11-Linux-x86_64.sh
~] ls
Anaconda3-2021.11-Linux-x86_64.sh
~] ls -lh
~] sh Anaconda3-2021.11-Linux-x86_64.sh # 选择默认选项
~] pwd
/home/centos
~] cd anaconda3
anaconda3] ls
anaconda3] cd bin
anaconda3] ./conda --version
conda 4.3.21
anaconda3] ./jupyter notebook --no-browser # 复制链接

本地终端

~ ssh -N -f -L localhost:888:localhost:8888 gitlab-demo-ci
~ ssh -N -f -L localhost:888:localhost:8888 root@gitlab-demo-ci

浏览器打开，链接复制进去

!ifconfig  # 对应linux系统中 ifconfig

1-6 Jupyter-notebook的使用演示

cd anaconda3
cd jupyter-notebook/python-data-science
python-data-science git:(master) ls
README.md    demo.ipynb
python-data-science git:(master) xx/bin/jupyter notebook # 可打开

第2章 Numpy入门

本章将介绍Python数据科学领域里最基础的一个库——Numpy，回顾矩阵运算基础，介绍最重要的数据结构Array以及如何通过Numpy进行数组和矩阵运算。

2-1 数据科学领域5个常用Python库

Numpy
Scipy
Pandas
Matplotlib
Scikit-learn

Numpy

N维数组(矩阵)，快速高效，矢量属性运算
高效的Index，不需要循环
开源免费跨平台，运行效率足以和C/Matlab媲美

Scipy

依赖于Numpy
专为科学和工程设计
实现了多种常用科学计算：线性代数，傅里叶变换，信号和图像处理

Pandas

结构化数据分析利器(依赖Numpy)
提供了多种高级数据结构：Time-Series，DataFrame，Panel
强大的数据索引和处理能力

Matplotlib

Python 2D绘图领域使用最广泛的套件
基本能取代Matlab的绘图功能(散点，曲线，柱形等)
通过mplot3d可以绘制精美的3D图

Scikit-learn

机器学习的Python模块
建立在Scipy之上，提供了常用的机器学习算法：聚类，回归
简单易学的API接口

2-2 数学基础回顾之矩阵运算

基本概念

矩阵：矩形的数据，即二维数组。其中向量和标量都是矩阵的特例
向量：是指1xn或者nx1的矩阵
标量：1x1的矩阵
数组：N维的数组，时矩阵的延伸

特殊矩阵

全0全1矩阵

单位矩阵

矩阵加减运算

相加、减的两个矩阵必须要有相同的行和列
行和列对应元素相加减

数组乘法(点乘)

数组乘法(点乘)是对应元素之间的乘法

矩阵乘法

设A为mxp的矩阵，B为pxn的矩阵，mxn的矩阵C为A与B的乘积，记为C=AB，其中矩阵C中的第i行第j列元素可以表示为：

其他线性代数知识

清华大学出版的线性代数
http://bs.szu.edu.cn/sljr/Up/day_110824/201108240409437708.pdf

2-3 Array的创建及访问

Jupyter notebook 新建文件 Array.ipynb

# 数组的创建和访问
import numpy as np
# create from python list
list_1 = [1, 2, 3, 4]
list_1        #  [1, 2, 3, 4]
array_1 = np.array(list_1)
array_1        # array([1, 2, 3, 4])
list_2 = [5, 6, 7, 8]
array_2 = np.array([list_1,list_2])
array_2
# array([[1, 2, 3, 4],
       [5, 6, 7, 8]])
array_2.shape    # (2, 4)
array_2.size    # 8
array_2.dtype    # dtype('int32') 看电脑，也可能是dtype('int64')
array_3 = np.array([[1.0,2,3],[4.0,5,6]])
array_3.dtype        # dtype('float64')
array_4 = np.arange(1,10)
array_4        # array([1, 2, 3, 4, 5, 6, 7, 8, 9])
array_4 = np.arange(1, 10, 2)
array_4        # array([1, 3, 5, 7, 9])
np.zeros(5)        # array([0., 0., 0., 0., 0.])    # 零矩阵
np.zeros([2,3])        # 两行三列的二维零矩阵
# array([[0., 0., 0.],
       [0., 0., 0.]])
np.eye(5)    # n=5的单位矩阵
# array([[1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1.]])
np.eye(5).dtype        # dtype('float64')
a = np.arange(1,10)
a        # array([1, 2, 3, 4, 5, 6, 7, 8, 9])
a[1]    # 2(取数组第2个元素)
a[1:5]    # array([2, 3, 4, 5]) 取数组第2-5个元素
b = np.array([[1,2,3],[4,5,6]])
b
# array([[1, 2, 3],
       [4, 5, 6]])
b[1][0]        # 4
b[1,0]        # 4
c = np.array([[1,2,3],[4,5,6],[7,8,9]])
c
# array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])
c[:2,1:]
# array([[2, 3],
       [5, 6]])

2-4 数组与矩阵运算

Jupyter notebook 新建文件数组与矩阵运算.ipynb

# 快速创建数组
import numpy as np
np.random.randn(10)        # 返回10个小数元素的一维数组
# array([ 0.26674666, -0.91111093,  0.30684449, -0.80206634, -0.89176532,
        0.7950014 , -0.53259808, -0.09981816,  1.2960139 , -0.9668373 ])
np.random.randint(10)    # 0
np.random.randint(10,size=(2,3))    # 生成一个2x3的二维数组，数组元素[0,9]
# array([[7, 5, 8],
       [1, 5, 8]])
np.random.randint(10,size=20)        # 生成20个元素的一维数组，数组元素[0,9]
# array([5, 6, 4, 8, 0, 9, 6, 2, 2, 9, 2, 1, 4, 6, 1, 5, 8, 2, 3, 4])
np.random.randint(10,size=20).reshape(4,5)    # 对生成20个元素的一维数组进行重塑成4x5的二维数组，数组元素[0,9]
# array([[7, 1, 0, 5, 7],
       [8, 0, 3, 7, 9],
       [9, 0, 7, 3, 2],
       [9, 1, 5, 8, 7]])

# 数组运算
a = np.random.randint(10,size=20).reshape(4,5)
b = np.random.randint(10,size=20).reshape(4,5)
a
# array([[2, 3, 8, 4, 8],
       [0, 7, 9, 9, 9],
       [1, 8, 1, 8, 6],
       [3, 4, 7, 5, 1]])
b
# array([[8, 4, 3, 1, 6],
       [4, 4, 6, 2, 9],
       [9, 4, 8, 5, 8],
       [6, 2, 5, 5, 8]])
a + b
# array([[10,  7, 11,  5, 14],
       [ 4, 11, 15, 11, 18],
       [10, 12,  9, 13, 14],
       [ 9,  6, 12, 10,  9]])
a - b
# array([[-6, -1,  5,  3,  2],
       [-4,  3,  3,  7,  0],
       [-8,  4, -7,  3, -2],
       [-3,  2,  2,  0, -7]])
a * b
# array([[16, 12, 24,  4, 48],
       [ 0, 28, 54, 18, 81],
       [ 9, 32,  8, 40, 48],
       [18,  8, 35, 25,  8]])
a / b
# 可能会报错，看b里是否有元素0
array([[0.25      , 0.75      , 2.66666667, 4.        , 1.33333333],
       [0.        , 1.75      , 1.5       , 4.5       , 1.        ],
       [0.11111111, 2.        , 0.125     , 1.6       , 0.75      ],
       [0.5       , 2.        , 1.4       , 1.        , 0.125     ]])
np.mat([[1,2,3],[4,5,6]])
# matrix([[1, 2, 3],
        [4, 5, 6]])
a
# array([[2, 3, 8, 4, 8],
       [0, 7, 9, 9, 9],
       [1, 8, 1, 8, 6],
       [3, 4, 7, 5, 1]])
np.mat(a)
# 
matrix([[2, 3, 8, 4, 8],
        [0, 7, 9, 9, 9],
        [1, 8, 1, 8, 6],
        [3, 4, 7, 5, 1]])

# 矩阵的运算
A = np.mat(a)
B = np.mat(b)
A
# matrix([[2, 3, 8, 4, 8],
        [0, 7, 9, 9, 9],
        [1, 8, 1, 8, 6],
        [3, 4, 7, 5, 1]])
B
# matrix([[8, 4, 3, 1, 6],
        [4, 4, 6, 2, 9],
        [9, 4, 8, 5, 8],
        [6, 2, 5, 5, 8]])
A + B
# matrix([[10,  7, 11,  5, 14],
        [ 4, 11, 15, 11, 18],
        [10, 12,  9, 13, 14],
        [ 9,  6, 12, 10,  9]])
A - B
# matrix([[-6, -1,  5,  3,  2],
        [-4,  3,  3,  7,  0],
        [-8,  4, -7,  3, -2],
        [-3,  2,  2,  0, -7]])
A * B    # 报错，A的列数和B的行数不一致

a = np.mat(np.random.randint(10,size=20).reshape(4,5))
b = np.mat(np.random.randint(10,size=20).reshape(5,4))
a
# matrix([[9, 9, 3, 0, 5],
        [9, 4, 6, 4, 5],
        [9, 0, 7, 0, 9],
        [7, 2, 6, 0, 6]])
b
# matrix([[2, 2, 6, 4],
        [8, 9, 8, 0],
        [2, 1, 3, 9],
        [3, 1, 0, 2],
        [9, 3, 1, 4]])    
a * b
# matrix([[141, 117, 140,  83],
        [119,  79, 109, 118],
        [113,  52,  84, 135],
        [ 96,  56,  82, 106]])

# Array常用函数
a = np.random.randint(10,size=20).reshape(4,5)
np.unique(a)    # 对a中所有元素去重
# array([0, 1, 2, 3, 4, 5, 6, 8, 9])
a
# array([[4, 2, 8, 4, 2],
       [6, 9, 6, 4, 0],
       [9, 2, 6, 9, 0],
       [1, 3, 8, 5, 9]])
sum(a)        # a中所有行列求和
# array([20, 16, 28, 22, 11])
sum(a[0])    # a中第一行求和
# 20
sum(a[:,0])    # a中第一列求和
# 20
a.max()        # a中最大值
# 9
max(a[0])    # a中第一行最大值
# 8
max(a[:,0])    # a中第一列最大值
# 9

2-5 Array的input和output

Jupyter notebook 新建文件 Array的input和output.ipynb

# 使用pickle序列化numpy array
import pickle
import numpy as np
x = np.arange(10)
x
# array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
f = open('x.pk1','wb')
pickle.dump(x, f)
!ls        # windows系统可用!dir
# Array.ipynb            Array的input和output.ipynb
  x.pk1                    数组与矩阵运算.ipynb
f = open('x.pk1','rb')
pickle.load(f)
# array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
np.save('one_array', x)
!ls
# Array.ipynb            Array的input和output.ipynb
  x.pk1                    one_array.npy
  数组与矩阵运算.ipynb
np.load('one_array.npy')
# array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.arange(20)
y
# array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])
np.savez('two_array.npz', a=x, b=y)
!ls
# Array.ipynb                        two_array.npz
  Array的input和output.ipynb        x.pk1
  one_array.npy                        数组与矩阵运算.ipynb
np.load('two_array.npz')
# <numpy.lib.npyio.NpzFile at 0x17033c77df0>
c = np.load('two_array.npz')
c['a']
# array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
c['b']
# array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])

scipy文档

现在：https://docs.scipy.org/doc/scipy/getting_started.html
之前：https://docs.scipy.org/doc/numpy-dev/user/quickstart.html

第3章 Pandas入门

本章将介绍Python数据科学领域用于数据分析最重要的一个库——Pandas。将从pandas里最重要的两种数据结构Series和DataFrame开始，介绍其创建和基本操作，通过实际操作理解Series和DataFrame的关系。

3-1 Pandas Series

Jupyter notebook 新建文件 Series.ipynb

import numpy as np
import pandas as pd
s1 = pd.Series([1,2,3,4])
s1
# 0    1
  1    2
  2    3
  3    4
  dtype: int64
s1.values
# array([1, 2, 3, 4], dtype=int64)
s1.index
# RangeIndex(start=0, stop=4, step=1)
s2 = pd.Series(np.arange(10))
s2            # 有些电脑 dtype: int64
# 0    0
  1    1
  2    2
  3    3
  4    4
  5    5
  6    6
  7    7
  8    8
  9    9
  dtype: int32
s3 = pd.Series({'1':1, '2':2, '3':3})
s3
# 1    1
  2    2
  3    3
  dtype: int64
s3.values
# array([1, 2, 3], dtype=int64)
s3.index
# Index(['1', '2', '3'], dtype='object')
s4 = pd.Series([1,2,3,4],index=['A','B','C','D'])
s4
# A    1
  B    2
  C    3
  D    4
  dtype: int64
s4.values
# array([1, 2, 3, 4], dtype=int64)
s4.index
# Index(['A', 'B', 'C', 'D'], dtype='object')
s4['A']
# 1
s4[s4>2]
# C    3
  D    4
  dtype: int64
s4
# A    1
  B    2
  C    3
  D    4
  dtype: int64
s4.to_dict()
# {'A': 1, 'B': 2, 'C': 3, 'D': 4} 
s5 = pd.Series(s4.to_dict())
s5
# A    1
  B    2
  C    3
  D    4
  dtype: int64
index_1 = ['A', 'B', 'C', 'D','E']
s6 = pd.Series(s5,index=index_1)
s6
# A    1.0
  B    2.0
  C    3.0
  D    4.0
  E    NaN
  dtype: float64
pd.isnull(s6)
# A    False
  B    False
  C    False
  D    False
  E     True
dtype: bool
pd.notnull(s6)
# A     True
  B     True
  C     True
  D     True
  E    False
  dtype: bool
s6
# A    1.0
  B    2.0
  C    3.0
  D    4.0
  E    NaN
  dtype: float64
s6.name = 'demo'
s6
# A    1.0
  B    2.0
  C    3.0
  D    4.0
  E    NaN
  Name: demo, dtype: float64
s6.index.name = 'demo index'
s6
# demo index
  A    1.0
  B    2.0
  C    3.0
  D    4.0
  E    NaN
  Name: demo, dtype: float64
s6.index
# Index(['A', 'B', 'C', 'D', 'E'], dtype='object', name='demo index')
s6.values
# array([ 1.,  2.,  3.,  4., nan])

3-2 Pandas DataFrame

Jupyter notebook 新建文件 DataFrame.ipynb

import numpy as np
import pandas as pd
from pandas import Series, DataFrame

import webbrowser
link = 'https://www.tiobe.com/tiobe-index/'
webbrowser.open(link)        # 浏览器里打开链接
True
df = pd.read_clipboard()    # 复制页面 table里前10条数据，包含表头
df
# 输出
Position    Programming    Language    Ratings
0    21    SAS    0.66%    None
1    22    Scratch    0.64%    None
2    23    Fortran    0.58%    None
3    24    Rust    0.54%    None
4    25    (Visual)    FoxPro    0.52%
5    26    COBOL    0.42%    None
6    27    Dart    0.42%    None
7    28    Kotlin    0.41%    None
8    29    Lua    0.40%    None
9    30    Julia    0.40%    None

type(df)
# pandas.core.frame.DataFrame
df.columns
# Index(['Position', 'Programming', 'Language', 'Ratings'], dtype='object')
df.Ratings
#
0     None
1     None
2     None
3     None
4    0.52%
5     None
6     None
7     None
8     None
9     None
Name: Ratings, dtype: object

df_new = DataFrame(df,columns=['Programming','Language'])
df_new
# 输出
Programming    Language
0    SAS    0.66%
1    Scratch    0.64%
2    Fortran    0.58%
3    Rust    0.54%
4    (Visual)    FoxPro
5    COBOL    0.42%
6    Dart    0.42%
7    Kotlin    0.41%
8    Lua    0.40%
9    Julia    0.40%

df['Position']
#
0    21
1    22
2    23
3    24
4    25
5    26
6    27
7    28
8    29
9    30
Name: Position, dtype: int64

type(df['Position'])
pandas.core.series.Series
df_new = DataFrame(df,columns=['Programming','Language','Language1'])
df_new
# 输出
Programming    Language    Language1
0    SAS    0.66%    NaN
1    Scratch    0.64%    NaN
2    Fortran    0.58%    NaN
3    Rust    0.54%    NaN
4    (Visual)    FoxPro    NaN
5    COBOL    0.42%    NaN
6    Dart    0.42%    NaN
7    Kotlin    0.41%    NaN
8    Lua    0.40%    NaN
9    Julia    0.40%    NaN

# 填充的三种方式
df_new['Language1'] = range(0,10)
# df_new['Language1'] = np.arange(0,10)
# df_new['Language1'] = pd.Series(np.arange(0,10))
df_new
# 输出
Programming    Language    Language1
0    SAS    0.66%    0
1    Scratch    0.64%    1
2    Fortran    0.58%    2
3    Rust    0.54%    3
4    (Visual)    FoxPro    4
5    COBOL    0.42%    5
6    Dart    0.42%    6
7    Kotlin    0.41%    7
8    Lua    0.40%    8
9    Julia    0.40%    9

df_new['Language1'] = pd.Series([100,200], index=[1,2])
df_new
# 输出
Programming    Language    Language1
0    SAS    0.66%    NaN
1    Scratch    0.64%    100.0
2    Fortran    0.58%    200.0
3    Rust    0.54%    NaN
4    (Visual)    FoxPro    NaN
5    COBOL    0.42%    NaN
6    Dart    0.42%    NaN
7    Kotlin    0.41%    NaN
8    Lua    0.40%    NaN
9    Julia    0.40%    NaN

3-3 深入理解Series和Dataframe

Jupyter notebook 新建文件深入理解Series和Dataframe.ipynb

import numpy as np
import pandas as pd
from pandas import Series, DataFrame

data = {'Country':['Belgium', 'India', 'Brazil'],
       'Capital':['Brussels','New Delhi', 'Brasilia'],
       'Population':[11190846, 1303171035, 207847528]}

#Series
s1 = pd.Series(data['Country'])
s1
# 输出
0    Belgium
1      India
2     Brazil
dtype: object

s1.values
# array(['Belgium', 'India', 'Brazil'], dtype=object)
s1.index
# RangeIndex(start=0, stop=3, step=1)
s1 = pd.Series(data['Country'],index=['A','B','C'])
# 输出
A    Belgium
B      India
C     Brazil
dtype: object

s1.values
# array(['Belgium', 'India', 'Brazil'], dtype=object)
s1.index
# Index(['A', 'B', 'C'], dtype='object')

#Dataframe
df1 = pd.DataFrame(data)
df1
# 输出
    Country    Capital    Population
0    Belgium    Brussels    11190846
1    India    New Delhi    1303171035
2    Brazil    Brasilia    207847528

df1['Country']
# 输出
0    Belgium
1      India
2     Brazil
Name: Country, dtype: object

cou = df1['Country']
type(cou)
# pandas.core.series.Series
df1.iterrows()
# <generator object DataFrame.iterrows at 0x0000018DD44C59E0>

for row in df1.iterrows():
    print(row),print(type(row)),print(len(row))
# 输出
(0, Country        Belgium
Capital       Brussels
Population    11190846
Name: 0, dtype: object)
<class 'tuple'>
2
(1, Country            India
Capital        New Delhi
Population    1303171035
Name: 1, dtype: object)
<class 'tuple'>
2
(2, Country          Brazil
Capital        Brasilia
Population    207847528
Name: 2, dtype: object)
<class 'tuple'>
2

for row in df1.iterrows():
    print(type(row[0]),row[0],row[1])
    break
# 输出
<class 'int'> 0 Country        Belgium
Capital       Brussels
Population    11190846
Name: 0, dtype: object

# <class 'int'>  ??
<class 'numpy.int64'> 


for row in df1.iterrows():
    print(type(row[0]),type(row[1]))
    break
# 输出
<class 'int'> <class 'pandas.core.series.Series'>

# <class 'int'>  ??
<class 'numpy.int64'> 


df1
# 输出
Country    Capital    Population
0    Belgium    Brussels    11190846
1    India    New Delhi    1303171035
2    Brazil    Brasilia    207847528

data
# 输出
{'Country': ['Belgium', 'India', 'Brazil'],
 'Capital': ['Brussels', 'New Delhi', 'Brasilia'],
 'Population': [11190846, 1303171035, 207847528]}


s1 = pd.Series(data['Country'])
s2 = pd.Series(data['Capital'])
s3 = pd.Series(data['Population'])
df_new = pd.DataFrame([s1,s2,s3])
df_new
# 输出
    0    1    2
0    Belgium    India    Brazil
1    Brussels    New Delhi    Brasilia
2    11190846    1303171035    207847528

df1
# 输出
Country    Capital    Population
0    Belgium    Brussels    11190846
1    India    New Delhi    1303171035
2    Brazil    Brasilia    207847528

df_new = df_new.T
df_new
# 输出
    0    1    2
0    Belgium    Brussels    11190846
1    India    New Delhi    1303171035
2    Brazil    Brasilia    207847528

df_new = pd.DataFrame([s1,s2,s3], index=['Country','Capital','Population'])
df_new
# 输出
        0    1    2
Country    Belgium    India    Brazil
Capital    Brussels    New Delhi    Brasilia
Population    11190846    1303171035    207847528

df_new = df_new.T
df_new
# 输出
    Country    Capital    Population
0    Belgium    Brussels    11190846
1    India    New Delhi    1303171035
2    Brazil    Brasilia    207847528

3-4 Pandas-Dataframe-IO操作

Jupyter notebook 新建文件 DataFrame IO.ipynb

import numpy as np
import pandas as pd
from pandas import Series,DataFrame

import webbrowser

link = 'http://pandas.pydata.org/pandas-docs/version/0.20/io.html'
webbrowser.open(link)    # 打开浏览器，返回True； 复制 网页表格内容
# True

df1 = pd.read_clipboard()
df1
# 输出
    Format Type    Data Description    Reader    Writer
0    text    CSV    read_csv    to_csv
1    text    JSON    read_json    to_json
2    text    HTML    read_html    to_html
3    text    Local clipboard    read_clipboard    to_clipboard
4    binary    MS Excel    read_excel    to_excel
5    binary    HDF5 Format    read_hdf    to_hdf
6    binary    Feather Format    read_feather    to_feather
7    binary    Msgpack    read_msgpack    to_msgpack
8    binary    Stata    read_stata    to_stata
9    binary    SAS    read_sas    
10    binary    Python Pickle Format    read_pickle    to_pickle
11    SQL    SQL    read_sql    to_sql
12    SQL    Google Big Query    read_gbq    to_gbq

df1.to_clipboard()
df1
# 输出
    Format Type    Data Description    Reader    Writer
0    text    CSV    read_csv    to_csv
1    text    JSON    read_json    to_json
2    text    HTML    read_html    to_html
3    text    Local clipboard    read_clipboard    to_clipboard
4    binary    MS Excel    read_excel    to_excel
5    binary    HDF5 Format    read_hdf    to_hdf
6    binary    Feather Format    read_feather    to_feather
7    binary    Msgpack    read_msgpack    to_msgpack
8    binary    Stata    read_stata    to_stata
9    binary    SAS    read_sas    
10    binary    Python Pickle Format    read_pickle    to_pickle
11    SQL    SQL    read_sql    to_sql
12    SQL    Google Big Query    read_gbq    to_gbq

df1.to_csv('df1.csv')
!ls   # windows系统可用 !dir
# DataFrame IO.ipynb    df1.csv

!more df1.csv
# 输出
,Format Type,Data Description,Reader,Writer
0,text,CSV,read_csv,to_csv
1,text,JSON,read_json,to_json
2,text,HTML,read_html,to_html
3,text,Local clipboard,read_clipboard,to_clipboard
4,binary,MS Excel,read_excel,to_excel
5,binary,HDF5 Format,read_hdf,to_hdf
6,binary,Feather Format,read_feather,to_feather
7,binary,Msgpack,read_msgpack,to_msgpack
8,binary,Stata,read_stata,to_stata
9,binary,SAS,read_sas, 
10,binary,Python Pickle Format,read_pickle,to_pickle
11,SQL,SQL,read_sql,to_sql
12,SQL,Google Big Query,read_gbq,to_gbq

df1.to_csv('df1.csv',index=False)    # 去掉索引
!ls
# DataFrame IO.ipynb    df1.csv

!more df1.csv
# 输出
Format Type,Data Description,Reader,Writer
text,CSV,read_csv,to_csv
text,JSON,read_json,to_json
text,HTML,read_html,to_html
text,Local clipboard,read_clipboard,to_clipboard
binary,MS Excel,read_excel,to_excel
binary,HDF5 Format,read_hdf,to_hdf
binary,Feather Format,read_feather,to_feather
binary,Msgpack,read_msgpack,to_msgpack
binary,Stata,read_stata,to_stata
binary,SAS,read_sas, 
binary,Python Pickle Format,read_pickle,to_pickle
SQL,SQL,read_sql,to_sql
SQL,Google Big Query,read_gbq,to_gbq

df2 = pd.read_csv('df1.csv')
df2
# 输出
    Format Type    Data Description    Reader    Writer
0    text    CSV    read_csv    to_csv
1    text    JSON    read_json    to_json
2    text    HTML    read_html    to_html
3    text    Local clipboard    read_clipboard    to_clipboard
4    binary    MS Excel    read_excel    to_excel
5    binary    HDF5 Format    read_hdf    to_hdf
6    binary    Feather Format    read_feather    to_feather
7    binary    Msgpack    read_msgpack    to_msgpack
8    binary    Stata    read_stata    to_stata
9    binary    SAS    read_sas    
10    binary    Python Pickle Format    read_pickle    to_pickle
11    SQL    SQL    read_sql    to_sql
12    SQL    Google Big Query    read_gbq    to_gbq

df1.to_json()
# 输出
'{"Format Type":{"0":"text","1":"text","2":"text","3":"text","4":"binary","5":"binary","6":"binary","7":"binary","8":"binary","9":"binary","10":"binary","11":"SQL","12":"SQL"},"Data Description":{"0":"CSV","1":"JSON","2":"HTML","3":"Local clipboard","4":"MS Excel","5":"HDF5 Format","6":"Feather Format","7":"Msgpack","8":"Stata","9":"SAS","10":"Python Pickle Format","11":"SQL","12":"Google Big Query"},"Reader":{"0":"read_csv","1":"read_json","2":"read_html","3":"read_clipboard","4":"read_excel","5":"read_hdf","6":"read_feather","7":"read_msgpack","8":"read_stata","9":"read_sas","10":"read_pickle","11":"read_sql","12":"read_gbq"},"Writer":{"0":"to_csv","1":"to_json","2":"to_html","3":"to_clipboard","4":"to_excel","5":"to_hdf","6":"to_feather","7":"to_msgpack","8":"to_stata","9":" ","10":"to_pickle","11":"to_sql","12":"to_gbq"}}'

pd.read_json(df1.to_json())
# 输出
    Format Type    Data Description    Reader    Writer
0    text    CSV    read_csv    to_csv
1    text    JSON    read_json    to_json
2    text    HTML    read_html    to_html
3    text    Local clipboard    read_clipboard    to_clipboard
4    binary    MS Excel    read_excel    to_excel
5    binary    HDF5 Format    read_hdf    to_hdf
6    binary    Feather Format    read_feather    to_feather
7    binary    Msgpack    read_msgpack    to_msgpack
8    binary    Stata    read_stata    to_stata
9    binary    SAS    read_sas    
10    binary    Python Pickle Format    read_pickle    to_pickle
11    SQL    SQL    read_sql    to_sql
12    SQL    Google Big Query    read_gbq    to_gbq


df1.to_html()
# 输出
'<table border="1" class="dataframe">\n  <thead>\n    <tr style="text-align: right;">\n      <th></th>\n      <th>Format Type</th>\n      <th>Data Description</th>\n      <th>Reader</th>\n      <th>Writer</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>text</td>\n      <td>CSV</td>\n      <td>read_csv</td>\n      <td>to_csv</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>text</td>\n      <td>JSON</td>\n      <td>read_json</td>\n      <td>to_json</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>text</td>\n      <td>HTML</td>\n      <td>read_html</td>\n      <td>to_html</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>text</td>\n      <td>Local clipboard</td>\n      <td>read_clipboard</td>\n      <td>to_clipboard</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>binary</td>\n      <td>MS Excel</td>\n      <td>read_excel</td>\n      <td>to_excel</td>\n    </tr>\n    <tr>\n      <th>5</th>\n      <td>binary</td>\n      <td>HDF5 Format</td>\n      <td>read_hdf</td>\n      <td>to_hdf</td>\n    </tr>\n    <tr>\n      <th>6</th>\n      <td>binary</td>\n      <td>Feather Format</td>\n      <td>read_feather</td>\n      <td>to_feather</td>\n    </tr>\n    <tr>\n      <th>7</th>\n      <td>binary</td>\n      <td>Msgpack</td>\n      <td>read_msgpack</td>\n      <td>to_msgpack</td>\n    </tr>\n    <tr>\n      <th>8</th>\n      <td>binary</td>\n      <td>Stata</td>\n      <td>read_stata</td>\n      <td>to_stata</td>\n    </tr>\n    <tr>\n      <th>9</th>\n      <td>binary</td>\n      <td>SAS</td>\n      <td>read_sas</td>\n      <td></td>\n    </tr>\n    <tr>\n      <th>10</th>\n      <td>binary</td>\n      <td>Python Pickle Format</td>\n      <td>read_pickle</td>\n      <td>to_pickle</td>\n    </tr>\n    <tr>\n      <th>11</th>\n      <td>SQL</td>\n      <td>SQL</td>\n      <td>read_sql</td>\n      <td>to_sql</td>\n    </tr>\n    <tr>\n      <th>12</th>\n      <td>SQL</td>\n      <td>Google Big Query</td>\n      <td>read_gbq</td>\n      <td>to_gbq</td>\n    </tr>\n  </tbody>\n</table>'

df1.to_html('df1.html')
!ls
# DataFrame IO.ipynb    df1.csv        df1.html
df1.to_excel('df1.xlsx')

3-5 DataFrame的Selecting和indexing

Jupyter notebook 新建文件 Selecting and indexing.ipynb

import numpy as np
import pandas as pd
from pandas import Series, DataFrame

!pwd    # pwd 对应windows系统 chdir
# /Users/xxx/xx

!ls /Users/xxx/xx/homework    # ls 对应windows系统 dir pwd
# movie_metadata.csv

imdb = pd.read_csv('/Users/xxx/xx/homework/movie_metadata.csv')
imdb
# 输出
color    director_name    num_critic_for_reviews    duration    director_facebook_likes    actor_3_facebook_likes    actor_2_name    actor_1_facebook_likes    gross    genres    ...    num_user_for_reviews    language    country    content_rating    budget    title_year    actor_2_facebook_likes    imdb_score    aspect_ratio    movie_facebook_likes
0    Color    James Cameron    723.0    178.0    0.0    855.0    Joel David Moore    1000.0    760505847.0    Action|Adventure|Fantasy|Sci-Fi    ...    3054.0    English    USA    PG-13    237000000.0    2009.0    936.0    7.9    1.78    33000
1    Color    Gore Verbinski    302.0    169.0    563.0    1000.0    Orlando Bloom    40000.0    309404152.0    Action|Adventure|Fantasy    ...    1238.0    English    USA    PG-13    300000000.0    2007.0    5000.0    7.1    2.35    0
2    Color    Sam Mendes    602.0    148.0    0.0    161.0    Rory Kinnear    11000.0    200074175.0    Action|Adventure|Thriller    ...    994.0    English    UK    PG-13    245000000.0    2015.0    393.0    6.8    2.35    85000
3    Color    Christopher Nolan    813.0    164.0    22000.0    23000.0    Christian Bale    27000.0    448130642.0    Action|Thriller    ...    2701.0    English    USA    PG-13    250000000.0    2012.0    23000.0    8.5    2.35    164000
4    NaN    Doug Walker    NaN    NaN    131.0    NaN    Rob Walker    131.0    NaN    Documentary    ...    NaN    NaN    NaN    NaN    NaN    NaN    12.0    7.1    NaN    0
...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...
5038    Color    Scott Smith    1.0    87.0    2.0    318.0    Daphne Zuniga    637.0    NaN    Comedy|Drama    ...    6.0    English    Canada    NaN    NaN    2013.0    470.0    7.7    NaN    84
5039    Color    NaN    43.0    43.0    NaN    319.0    Valorie Curry    841.0    NaN    Crime|Drama|Mystery|Thriller    ...    359.0    English    USA    TV-14    NaN    NaN    593.0    7.5    16.00    32000
5040    Color    Benjamin Roberds    13.0    76.0    0.0    0.0    Maxwell Moody    0.0    NaN    Drama|Horror|Thriller    ...    3.0    English    USA    NaN    1400.0    2013.0    0.0    6.3    NaN    16
5041    Color    Daniel Hsia    14.0    100.0    0.0    489.0    Daniel Henney    946.0    10443.0    Comedy|Drama|Romance    ...    9.0    English    USA    PG-13    NaN    2012.0    719.0    6.3    2.35    660
5042    Color    Jon Gunn    43.0    90.0    16.0    16.0    Brian Herzlinger    86.0    85222.0    Documentary    ...    84.0    English    USA    PG    1100.0    2004.0    23.0    6.6    1.85    456
5043 rows × 28 columns

imdb.shape
# (5043, 28)

imdb.head()
# 输出
    color    director_name    num_critic_for_reviews    duration    director_facebook_likes    actor_3_facebook_likes    actor_2_name    actor_1_facebook_likes    gross    genres    ...    num_user_for_reviews    language    country    content_rating    budget    title_year    actor_2_facebook_likes    imdb_score    aspect_ratio    movie_facebook_likes
0    Color    James Cameron    723.0    178.0    0.0    855.0    Joel David Moore    1000.0    760505847.0    Action|Adventure|Fantasy|Sci-Fi    ...    3054.0    English    USA    PG-13    237000000.0    2009.0    936.0    7.9    1.78    33000
1    Color    Gore Verbinski    302.0    169.0    563.0    1000.0    Orlando Bloom    40000.0    309404152.0    Action|Adventure|Fantasy    ...    1238.0    English    USA    PG-13    300000000.0    2007.0    5000.0    7.1    2.35    0
2    Color    Sam Mendes    602.0    148.0    0.0    161.0    Rory Kinnear    11000.0    200074175.0    Action|Adventure|Thriller    ...    994.0    English    UK    PG-13    245000000.0    2015.0    393.0    6.8    2.35    85000
3    Color    Christopher Nolan    813.0    164.0    22000.0    23000.0    Christian Bale    27000.0    448130642.0    Action|Thriller    ...    2701.0    English    USA    PG-13    250000000.0    2012.0    23000.0    8.5    2.35    164000
4    NaN    Doug Walker    NaN    NaN    131.0    NaN    Rob Walker    131.0    NaN    Documentary    ...    NaN    NaN    NaN    NaN    NaN    NaN    12.0    7.1    NaN    0
5 rows × 28 columns

imdb.tail(10)
# 输出
color    director_name    num_critic_for_reviews    duration    director_facebook_likes    actor_3_facebook_likes    actor_2_name    actor_1_facebook_likes    gross    genres    ...    num_user_for_reviews    language    country    content_rating    budget    title_year    actor_2_facebook_likes    imdb_score    aspect_ratio    movie_facebook_likes
5033    Color    Shane Carruth    143.0    77.0    291.0    8.0    David Sullivan    291.0    424760.0    Drama|Sci-Fi|Thriller    ...    371.0    English    USA    PG-13    7000.0    2004.0    45.0    7.0    1.85    19000
5034    Color    Neill Dela Llana    35.0    80.0    0.0    0.0    Edgar Tancangco    0.0    70071.0    Thriller    ...    35.0    English    Philippines    Not Rated    7000.0    2005.0    0.0    6.3    NaN    74
5035    Color    Robert Rodriguez    56.0    81.0    0.0    6.0    Peter Marquardt    121.0    2040920.0    Action|Crime|Drama|Romance|Thriller    ...    130.0    Spanish    USA    R    7000.0    1992.0    20.0    6.9    1.37    0
5036    Color    Anthony Vallone    NaN    84.0    2.0    2.0    John Considine    45.0    NaN    Crime|Drama    ...    1.0    English    USA    PG-13    3250.0    2005.0    44.0    7.8    NaN    4
5037    Color    Edward Burns    14.0    95.0    0.0    133.0    Caitlin FitzGerald    296.0    4584.0    Comedy|Drama    ...    14.0    English    USA    Not Rated    9000.0    2011.0    205.0    6.4    NaN    413
5038    Color    Scott Smith    1.0    87.0    2.0    318.0    Daphne Zuniga    637.0    NaN    Comedy|Drama    ...    6.0    English    Canada    NaN    NaN    2013.0    470.0    7.7    NaN    84
5039    Color    NaN    43.0    43.0    NaN    319.0    Valorie Curry    841.0    NaN    Crime|Drama|Mystery|Thriller    ...    359.0    English    USA    TV-14    NaN    NaN    593.0    7.5    16.00    32000
5040    Color    Benjamin Roberds    13.0    76.0    0.0    0.0    Maxwell Moody    0.0    NaN    Drama|Horror|Thriller    ...    3.0    English    USA    NaN    1400.0    2013.0    0.0    6.3    NaN    16
5041    Color    Daniel Hsia    14.0    100.0    0.0    489.0    Daniel Henney    946.0    10443.0    Comedy|Drama|Romance    ...    9.0    English    USA    PG-13    NaN    2012.0    719.0    6.3    2.35    660
5042    Color    Jon Gunn    43.0    90.0    16.0    16.0    Brian Herzlinger    86.0    85222.0    Documentary    ...    84.0    English    USA    PG    1100.0    2004.0    23.0    6.6    1.85    456
10 rows × 28 columns

imdb['color']
# 输出
0       Color
1       Color
2       Color
3       Color
4         NaN
        ...  
5038    Color
5039    Color
5040    Color
5041    Color
5042    Color
Name: color, Length: 5043, dtype: object

imdb['color'][0]
# 'Color'
imdb['color'][1]
# 'Color'

imdb[['color','director_name']]
# 输出
    color    director_name
0    Color    James Cameron
1    Color    Gore Verbinski
2    Color    Sam Mendes
3    Color    Christopher Nolan
4    NaN    Doug Walker
...    ...    ...
5038    Color    Scott Smith
5039    Color    NaN
5040    Color    Benjamin Roberds
5041    Color    Daniel Hsia
5042    Color    Jon Gunn
5043 rows × 2 columns

sub_df = imdb[['director_name','movie_title','imdb_score']]
sub_df
# 输出
director_name    movie_title    imdb_score
0    James Cameron    Avatar    7.9
1    Gore Verbinski    Pirates of the Caribbean: At World's End    7.1
2    Sam Mendes    Spectre    6.8
3    Christopher Nolan    The Dark Knight Rises    8.5
4    Doug Walker    Star Wars: Episode VII - The Force Awakens  ...    7.1
...    ...    ...    ...
5038    Scott Smith    Signed Sealed Delivered    7.7
5039    NaN    The Following    7.5
5040    Benjamin Roberds    A Plague So Pleasant    6.3
5041    Daniel Hsia    Shanghai Calling    6.3
5042    Jon Gunn    My Date with Drew    6.6
5043 rows × 3 columns

sub_df.head()
# 输出
    director_name    movie_title    imdb_score
0    James Cameron    Avatar    7.9
1    Gore Verbinski    Pirates of the Caribbean: At World's End    7.1
2    Sam Mendes    Spectre    6.8
3    Christopher Nolan    The Dark Knight Rises    8.5
4    Doug Walker    Star Wars: Episode VII - The Force Awakens  ...    7.1

sub_df.head(5)
# 输出
    director_name    movie_title    imdb_score
0    James Cameron    Avatar    7.9
1    Gore Verbinski    Pirates of the Caribbean: At World's End    7.1
2    Sam Mendes    Spectre    6.8
3    Christopher Nolan    The Dark Knight Rises    8.5
4    Doug Walker    Star Wars: Episode VII - The Force Awakens  ...    7.1

sub_df.iloc[10:20,:]
# 输出
    director_name    movie_title    imdb_score
10    Zack Snyder    Batman v Superman: Dawn of Justice    6.9
11    Bryan Singer    Superman Returns    6.1
12    Marc Forster    Quantum of Solace    6.7
13    Gore Verbinski    Pirates of the Caribbean: Dead Man's Chest    7.3
14    Gore Verbinski    The Lone Ranger    6.5
15    Zack Snyder    Man of Steel    7.2
16    Andrew Adamson    The Chronicles of Narnia: Prince Caspian    6.6
17    Joss Whedon    The Avengers    8.1
18    Rob Marshall    Pirates of the Caribbean: On Stranger Tides    6.7
19    Barry Sonnenfeld    Men in Black 3    6.8

sub_df.iloc[10:20,0:2]
# 输出
director_name    movie_title
10    Zack Snyder    Batman v Superman: Dawn of Justice
11    Bryan Singer    Superman Returns
12    Marc Forster    Quantum of Solace
13    Gore Verbinski    Pirates of the Caribbean: Dead Man's Chest
14    Gore Verbinski    The Lone Ranger
15    Zack Snyder    Man of Steel
16    Andrew Adamson    The Chronicles of Narnia: Prince Caspian
17    Joss Whedon    The Avengers
18    Rob Marshall    Pirates of the Caribbean: On Stranger Tides
19    Barry Sonnenfeld    Men in Black 3

tmp_df = sub_df.iloc[10:20,0:2]
tmp_df
# 输出
director_name    movie_title
10    Zack Snyder    Batman v Superman: Dawn of Justice
11    Bryan Singer    Superman Returns
12    Marc Forster    Quantum of Solace
13    Gore Verbinski    Pirates of the Caribbean: Dead Man's Chest
14    Gore Verbinski    The Lone Ranger
15    Zack Snyder    Man of Steel
16    Andrew Adamson    The Chronicles of Narnia: Prince Caspian
17    Joss Whedon    The Avengers
18    Rob Marshall    Pirates of the Caribbean: On Stranger Tides
19    Barry Sonnenfeld    Men in Black 3

tmp_df.iloc[2:4,:]
# 输出
    director_name    movie_title
12    Marc Forster    Quantum of Solace
13    Gore Verbinski    Pirates of the Caribbean: Dead Man's Chest

tmp_df.loc[15:17,:]
# 输出
    director_name    movie_title
15    Zack Snyder    Man of Steel
16    Andrew Adamson    The Chronicles of Narnia: Prince Caspian
17    Joss Whedon    The Avengers

tmp_df.loc[15:17,:'movie_title']
# 输出
    director_name    movie_title
15    Zack Snyder    Man of Steel
16    Andrew Adamson    The Chronicles of Narnia: Prince Caspian
17    Joss Whedon    The Avengers

tmp_df.loc[15:17,:'director_name']
# 输出
    director_name
15    Zack Snyder
16    Andrew Adamson
17    Joss Whedon

3-6 Series和Dataframe的Reindexing

Jupyter notebook 新建文件 Reindexing Series and DataFrame.ipynb

import numpy as np
import pandas as pd
from pandas import Series, DataFrame

# series reindex
s1 = Series([1,2,3,4], index=['A','B','C','D'])
s1
# 输出
A    1
B    2
C    3
D    4
dtype: int64

# s1.reindex()    # 光标移动到方法上面，按shift+tab，弹出文档，连续按选择文档详细程度
s1.reindex(index=['A','B','C','D','E'])
# 输出
A    1.0
B    2.0
C    3.0
D    4.0
E    NaN
dtype: float64

s1.reindex(index=['A','B','C','D','E'],fill_value=0)
# 输出
A    1
B    2
C    3
D    4
E    0
dtype: int64

s1.reindex(index=['A','B','C','D','E'],fill_value=10)
# 输出
A     1
B     2
C     3
D     4
E    10
dtype: int64

s2 = Series(['A','B','C'], index=[1,5,10])
s2
# 输出
1     A
5     B
10    C
dtype: object

s2.reindex(index=range(15))
# 输出
0     NaN
1       A
2     NaN
3     NaN
4     NaN
5       B
6     NaN
7     NaN
8     NaN
9     NaN
10      C
11    NaN
12    NaN
13    NaN
14    NaN
dtype: object

s2.reindex(index=range(15),method='ffill')
# 输出
0     NaN
1       A
2       A
3       A
4       A
5       B
6       B
7       B
8       B
9       B
10      C
11      C
12      C
13      C
14      C
dtype: object

# reindex dataframe
df1 = DataFrame(np.random.rand(25).reshape([5,5]))
df1
# 输出
    0    1    2    3    4
0    0.255424    0.315708    0.951327    0.423676    0.975377
1    0.087594    0.192460    0.502268    0.534926    0.423024
2    0.817002    0.113410    0.468270    0.410297    0.278942
3    0.315239    0.018933    0.133764    0.240001    0.910754
4    0.267342    0.451077    0.282865    0.170235    0.898429


df1 = DataFrame(np.random.rand(25).reshape([5,5]), index=['A','B','D','E','F'], columns=['c1','c2','c3','c4','c5'])
df1
# 输出
    c1    c2    c3    c4    c5
A    0.278063    0.894546    0.932129    0.178442    0.303684
B    0.186239    0.260677    0.708358    0.275914    0.369878
D    0.786987    0.125907    0.191987    0.338194    0.009877
E    0.192269    0.909661    0.227301    0.343989    0.610203
F    0.503267    0.306472    0.197467    0.063800    0.813786

df1.reindex(index=['A','B','C','D','E','F'])
# 输出
    c1    c2    c3    c4    c5
A    0.278063    0.894546    0.932129    0.178442    0.303684
B    0.186239    0.260677    0.708358    0.275914    0.369878
C    NaN    NaN    NaN    NaN    NaN
D    0.786987    0.125907    0.191987    0.338194    0.009877
E    0.192269    0.909661    0.227301    0.343989    0.610203
F    0.503267    0.306472    0.197467    0.063800    0.813786

df1.reindex(columns=['c1','c2','c3','c4','c5','c6'])
# 输出
    c1    c2    c3    c4    c5    c6
A    0.278063    0.894546    0.932129    0.178442    0.303684    NaN
B    0.186239    0.260677    0.708358    0.275914    0.369878    NaN
D    0.786987    0.125907    0.191987    0.338194    0.009877    NaN
E    0.192269    0.909661    0.227301    0.343989    0.610203    NaN
F    0.503267    0.306472    0.197467    0.063800    0.813786    NaN

df1.reindex(index=['A','B','C','D','E','F'],columns=['c1','c2','c3','c4','c5','c6'])
# 输出
    c1    c2    c3    c4    c5    c6
A    0.278063    0.894546    0.932129    0.178442    0.303684    NaN
B    0.186239    0.260677    0.708358    0.275914    0.369878    NaN
C    NaN    NaN    NaN    NaN    NaN    NaN
D    0.786987    0.125907    0.191987    0.338194    0.009877    NaN
E    0.192269    0.909661    0.227301    0.343989    0.610203    NaN
F    0.503267    0.306472    0.197467    0.063800    0.813786    NaN


s1
# 输出
A    1
B    2
C    3
D    4
dtype: int64

s1.reindex(index=['A','B'])
# 输出
A    1
B    2
dtype: int64


df1
# 输出
    c1    c2    c3    c4    c5
A    0.278063    0.894546    0.932129    0.178442    0.303684
B    0.186239    0.260677    0.708358    0.275914    0.369878
D    0.786987    0.125907    0.191987    0.338194    0.009877
E    0.192269    0.909661    0.227301    0.343989    0.610203
F    0.503267    0.306472    0.197467    0.063800    0.813786

df1.reindex(index=['A','B'])
# 输出
    c1    c2    c3    c4    c5
A    0.278063    0.894546    0.932129    0.178442    0.303684
B    0.186239    0.260677    0.708358    0.275914    0.369878

s1
# 输出
A    1
B    2
C    3
D    4
dtype: int64

s1.drop('A')
# 输出
B    2
C    3
D    4
dtype: int64

df1
# 输出
    c1    c2    c3    c4    c5
A    0.278063    0.894546    0.932129    0.178442    0.303684
B    0.186239    0.260677    0.708358    0.275914    0.369878
D    0.786987    0.125907    0.191987    0.338194    0.009877
E    0.192269    0.909661    0.227301    0.343989    0.610203
F    0.503267    0.306472    0.197467    0.063800    0.813786

df1.drop('A',axis=0)
# 输出
    c1    c2    c3    c4    c5
B    0.186239    0.260677    0.708358    0.275914    0.369878
D    0.786987    0.125907    0.191987    0.338194    0.009877
E    0.192269    0.909661    0.227301    0.343989    0.610203
F    0.503267    0.306472    0.197467    0.063800    0.813786

df1.drop('c1',axis=0)
# 报错，行中没有该字段

df1.drop('c1',axis=1)
# 输出
    c2    c3    c4    c5
A    0.894546    0.932129    0.178442    0.303684
B    0.260677    0.708358    0.275914    0.369878
D    0.125907    0.191987    0.338194    0.009877
E    0.909661    0.227301    0.343989    0.610203
F    0.306472    0.197467    0.063800    0.813786

3-7 谈一谈NaN

Jupyter notebook 新建文件谈一谈NaN.ipynb

# NaN - means Not a Number
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

n = np.nan
type(n)
# float

m = 1
m + n
# nan


# Nan in Series
s1 = Series([1, 2, np.nan, 3, 4], index=['A','B','C','D','E'])
s1
# 输出
A    1.0
B    2.0
C    NaN
D    3.0
E    4.0
dtype: float64

s1.isnull()
# 输出
A    False
B    False
C     True
D    False
E    False
dtype: bool

s1.notnull()
# 输出
A     True
B     True
C    False
D     True
E     True
dtype: bool

s1
# 输出
A    1.0
B    2.0
C    NaN
D    3.0
E    4.0
dtype: float64

s1.dropna()
# 输出
A    1.0
B    2.0
D    3.0
E    4.0
dtype: float64

# Nan in DataFrame
dframe = DataFrame([[1,2,3],[np.nan,5,6],[7,np.nan,9],[np.nan,np.nan,np.nan]])
dframe
# 输出
    0    1    2
0    1.0    2.0    3.0
1    NaN    5.0    6.0
2    7.0    NaN    9.0
3    NaN    NaN    NaN

dframe.isnull()
# 输出
    0    1    2
0    False    False    False
1    True    False    False
2    False    True    False
3    True    True    True

dframe.notnull()
# 输出
    0    1    2
0    True    True    True
1    False    True    True
2    True    False    True
3    False    False    False

df1 = dframe.dropna(axis=0)
df1
# 输出
    0    1    2
0    1.0    2.0    3.0


df1 = dframe.dropna(axis=1)
df1
# 输出
0
1
2
3

df1 = dframe.dropna(axis=1,how='any')
df1
# 输出
0
1
2
3

# 输出
df1 = dframe.dropna(axis=0,how='any')
df1
# 输出
    0    1    2
0    1.0    2.0    3.0

df1 = dframe.dropna(axis=0,how='all')
df1
# 输出
    0    1    2
0    1.0    2.0    3.0
1    NaN    5.0    6.0
2    7.0    NaN    9.0

dframe2 = DataFrame([[1,2,3,np.nan],[2,np.nan,5,6],[np.nan,7,np.nan,9],[1,np.nan,np.nan,np.nan]])
dframe2
# 输出
    0    1    2    3
0    1.0    2.0    3.0    NaN
1    2.0    NaN    5.0    6.0
2    NaN    7.0    NaN    9.0
3    1.0    NaN    NaN    NaN

df2 = dframe2.dropna(thresh=None)
df2
# 输出
0    1    2    3

df2 = dframe2.dropna(thresh=2)
df2
# 输出
    0    1    2    3
0    1.0    2.0    3.0    NaN
1    2.0    NaN    5.0    6.0
2    NaN    7.0    NaN    9.0

dframe2
# 输出
    0    1    2    3
0    1.0    2.0    3.0    NaN
1    2.0    NaN    5.0    6.0
2    NaN    7.0    NaN    9.0
3    1.0    NaN    NaN    NaN

dframe2.fillna(value=1)
# 输出
    0    1    2    3
0    1.0    2.0    3.0    1.0
1    2.0    1.0    5.0    6.0
2    1.0    7.0    1.0    9.0
3    1.0    1.0    1.0    1.0

dframe2.fillna(value={0:0,1:1,2:2,3:3})    # 列填充
# 输出
    0    1    2    3
0    1.0    2.0    3.0    3.0
1    2.0    1.0    5.0    6.0
2    0.0    7.0    2.0    9.0
3    1.0    1.0    2.0    3.0

df1
# 输出
    0    1    2
0    1.0    2.0    3.0
1    NaN    5.0    6.0
2    7.0    NaN    9.0

df2
# 输出
    0    1    2    3
0    1.0    2.0    3.0    NaN
1    2.0    NaN    5.0    6.0
2    NaN    7.0    NaN    9.0

df1.dropna()
# 输出
    0    1    2
0    1.0    2.0    3.0

df1.fillna(1)
# 输出
    0    1    2
0    1.0    2.0    3.0
1    1.0    5.0    6.0
2    7.0    1.0    9.0

Python数据分析入门与实践-笔记

第1章 实验环境的搭建