PyPackage01---Pandas06_取子集subset-CFANZ编程社区

简单的可以分为两类，一类是单纯的行，列取子集(以索引筛选)；另一类是筛选出符合要求的子集。先介绍简单的行列subset，后介绍条件筛选。

import pandas as pd
x = pd.DataFrame({'x1':[1,2,3],'x2':[4,5,6],'x3':[7,8,9]})

1.1 取单个列

返回的数据类型可能不同，要注意下。

## 返回的Series
print(x.x1);type(x.x1)

0    1
1    2
2    3
Name: x1, dtype: int64





pandas.core.series.Series

## 返回的Series
print(x["x1"]);type(x["x1"])

0    1
1    2
2    3
Name: x1, dtype: int64





pandas.core.series.Series

## 这样返回的才是DataFrame
print(x[["x1"]]);type(x[["x1"]])

   x1
0   1
1   2
2   3





pandas.core.frame.DataFrame

## 根据列名
print(x[["x1","x2"]])

# loc方法
x.loc[:,["x1","x2"]]

# iloc方法
x.iloc[:,[0,1]]

## 和列操作一样，单行是Series
print(x.loc[0]);type(x.loc[0])

x1    1
x2    4
x3    7
Name: 0, dtype: int64





pandas.core.series.Series

## 这样，单行是DataFrame
print(x.loc[[0]]);type(x.loc[[0]])

   x1  x2  x3
0   1   4   7





pandas.core.frame.DataFrame

## 取多行是DataFrame
print(x.loc[[0,2]]);type(x.loc[[0,2]])

   x1  x2  x3
0   1   4   7
2   3   6   9





pandas.core.frame.DataFrame

主要有两个函数可以用：loc、iloc。两者的区别在于:loc根据具体列名选取列;而iloc根据列所在位置/索引选取列，从0开始计数。

x.loc[[0,2],['x1','x2']]

## 选取所有行，用:代替
x.iloc[:,[1,2]]

根据一定的条件筛选出符合条件的数据，不知道能不能找到类似R里面subset的函数。

## 涉及到运算优先级的问题，"&"优先级高于所以要加()，改变优先级
x[(x.x2>4) & (x.x3<9)]

	x1	x2	x3
1	2	5	8

x[x.x1.isin([2])]

	x1	x2	x3
1	2	5	8

where的语法有点怪，更像是ifelse的处理方式。

x.where(cond, other=nan, inplace=False, axis=None, level=None, try_cast=False, raise_on_error=True)

x.where(x.x1 == 2,-x)

x.where(cond = (lambda x:x>2),other=(lambda x:x+10))

这个和R里subset比较接近，但是没有select的功能(即可选地保留最终想要的列)。等号要写成判断的形式"=="

##  条件写成字符串形式，并是and
x.query("x2>4 and x3<9")

	x1	x2	x3
1	2	5	8

根据列名(column)or行名(index)进行数据筛选。

DataFrame.filter(items=None, like=None, regex=None, axis=None)

参考：https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.filter.html

import pandas as pd
x = pd.DataFrame({'x1': [1, 2, 3], 'x2': [4, 5, 6], 'y1': [7, 8, 9]})

# 和axis搭配使用，效果更佳
x.filter(items=['x1','x2'],axis = 1)

x.filter(items=[0,2],axis = 0)

# 匹配列名为x的列
x.filter(regex='^x',axis = 1)

# 匹配行名为0or2的列
x.filter(regex='^[02]',axis = 0)

# 匹配列名包含y的列
# 类似于sql中的"%y%"
x.filter(like='y',axis = 1)

2018-10-13 于南京市栖霞区紫东创业园