Pandas数据清洗:处理缺失值

阅读 197

2022-10-08


在Pandas中,可以使用dropa方法条件过滤缺失值,用isnull标记哪些是缺失值,用notnull方法标记哪些不是缺失值,用fillna方法填充缺失值。

import pandas as pd

frame = pd.DataFrame([[1,2,3,None], [4,7,None,3], [None, None, None, None]])
frame.info()
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 3 entries, 0 to 2
# Data columns (total 4 columns):
# # Column Non-Null Count Dtype
# --- ------ -------------- -----
# 0 0 2 non-null float64
# 1 1 2 non-null float64
# 2 2 1 non-null float64
# 3 3 1 non-null float64
# dtypes: float64(4)
# memory usage: 224.0 bytes

# 删除
# 任意元素是空,整行删掉
frame.dropna(how='any')
# Empty DataFrame
# Columns: [0, 1, 2, 3]
# Index: []

# 一行里全部是NaN,即删除整行
frame.dropna(how='all')
# 0 1 2 3
# 0 1.0 2.0 3.0 NaN
# 1 4.0 7.0 NaN 3.0

# 哪些数据是缺失值
frame.isnull()
# 0 1 2 3
# 0 False False False True
# 1 False False True False
# 2 True True True True

# 常数填充
frame.fillna(1)
# 0 1 2 3
# 0 1.0 2.0 3.0 1.0
# 1 4.0 7.0 1.0 3.0
# 2 1.0 1.0 1.0 1.0

# 字典填充
# 0列填充1,1列填充5,2列填充10,3列填充20
frame.fillna({0:1, 1:5, 2:10, 3:20})
# 0 1 2 3
# 0 1.0 2.0 3.0 20.0
# 1 4.0 7.0 10.0 3.0
# 2 1.0 5.0 10.0 20.0

# 以列为单位,填充之前出现的值
frame.fillna(method='ffill')
# 0 1 2 3
# 0 1.0 2.0 3.0 NaN
# 1 4.0 7.0 3.0 3.0
# 2 4.0 7.0 3.0 3.0

# 对于缺失值,用前一个非缺失值去填充该缺失值
frame.fillna(method='pad')

# 直接替换原始数据
frame.fillna(1, inplace=True)
frame
# 0 1 2 3
# 0 1.0 2.0 3.0 1.0
# 1 4.0 7.0 1.0 3.0
# 2 1.0 1.0 1.0 1.0

参考:https://www.pypandas.cn/docs/user_guide/missing_data.html#values-considered-missing


精彩评论(0)

0 0 举报