Python库使用说明-CFANZ编程社区

一、 python原始方法

1、针对字典以列表形式返回可遍历的（键、值）元组数组

核心代码：

dict.items()

2、显示进度条（tqdm库）

核心代码：tqdm(iterator)

完整代码:

from tqdm import tqdm

import time

for i in tqdm(range(100)):

time.sleep(0.5)

3、x // y 的做用

x / y的商再取int()函数

4、enumerate方法功能

返回枚举对象，返回(index, data)的数据格式。

5、zip的使用

将对象对应元素打包成一个个元组，然后返回由这些元组组成的列表

完整代码:

us = ['a', 'b', 'c']

res = zip(us, range(3, len(us) + 3))

for item in res:

print(item)

6、dict字典使用说明

同一个字典的键必须是不可变类型，但是值可以取任何的数据类型

完整代码：

feat_dict = {}

us = ['a', 'b', 'c']

feat_dict['test'] = dict(zip(us, range(3, len(us) + 3)))

feat_dict['test2'] = 1

7、python传参是单星号(*)和双星号(**)的含义

def foo(*param): xxx

def foo(**param): xxx

这两种方法其实都是用来将任意个数的参数导入到pyhon函数中。其中单星号将所有参数以元组(tuple)的形式导入；双星号(**)将参数以字典的形式导入。

完整代码：

def fooTup(param1, *param2):

print(param1)

print(param2)

def fooDict(param1, **param2):

print(param1)

print(param2)

fooTup(1, *(2,3,4,5))

fooDict(1, **dict(zip(('a', 'b', 'c'), (1, 2, 3))))

8、矢量化操作，将二维数组arr中第二维对应temp位置的值全部加1

arr[i][temp] += 1

9、获取对象属性值

使用getattr方法，示例如下:

class A(object):

bar = 1

a = A()

print(getattr(a, "bar", 3))

print(getattr(a, "bar2", 3)) // 第3个参数代表没有返回的默认

二、 pandas库方法

1、获取csv中train对应user列的所有值

核心代码：

train = pd.read_csv(xx)

train[‘user’].values

完整代码:

cols = ['user', 'item', 'rating', 'timestamp']

train = pd.read_csv('data/ua.txt', delimiter='\t', names=cols)

users = train['user'].values

print(users)

2、创建一个4*3的随机数的矩阵，column名称分别为"A", "B", "C"，然后本身删除"A"列的内容

df = pd.DataFrame(np.random.randn(4,3), columns=['A', 'B', 'C'])

df.drop(['A'], axis=1, inplace=True)

3、根据不同的轴作简单的融合

使用pd.concat()方法。

完整代码：

train = pd.DataFrame(np.random.rand(2,3), columns=['user', 'item', 'rating'])

test = pd.DataFrame(np.random.rand(5,3), columns=['user', 'item', 'rating'])

total = pd.concat([train, test], axis=0) # 行之间融合

4、移动数据后与原数据做差异

使用diff函数，它是将数据进行某种移动之后与原数据进行比较得出的差异数据。

df.diff() 等价于：

先 df.shift()，然后df - df.shift()

5、dropna

删除NA（即为空）的数据，如果inplace=True这个参数是指直接在原有的对象上面进行操作，如果inplace为false那么返回一个新的对象。

6、用index索引进行定位

mydict = [{'a':1, 'b':2, 'c':3, 'd':4},

{'a':100, 'b':200, 'c':300, 'd':400},

{'a':1000, 'b':2000, 'c':3000, 'd':4000}]

df = pd.DataFrame(mydict)

print(df.iloc[0])

输出：

a 1

b 2

c 3

d 4

Name: 0, dtype: int64

7、表格相同位置数据相加

se1 = pd.DataFrame([1,1,1,np.nan], index=['a', 'b','c','d'], columns=['one'])

se2 = pd.DataFrame(dict(one=[1,np.nan,1,np.nan], two=[np.nan,2,np.nan,2]), index=['a','b','c','d'])

print(se1)

print(se2)

se3 = se1.add(se2, fill_value=0)

print(se3)

输出：

one

a 1.0

b 1.0

c 1.0

d NaN

one two

a 1.0 NaN

b NaN 2.0

c 1.0 NaN

d NaN 2.0

one two

a 2.0 NaN

b 1.0 2.0

c 2.0 NaN

d NaN 2.0

8、生成一段时间范围的时间数据

print(pd.date_range(start='1/1/2018', periods=8, freq='MS'))

输出：

DatetimeIndex(['2018-01-01', '2018-02-01', '2018-03-01', '2018-04-01',

'2018-05-01', '2018-06-01', '2018-07-01', '2018-08-01'],

dtype='datetime64[ns]', freq='MS')

9、reset_index函数

老的index被增加为一个column，然后新的序列号会作为index

举例：

df = pd.DataFrame([('bird', 389.0),

('bird', 24.0),

('manmal', 80.5),

('mammal', np.nan)],

index=['falcon', 'parrot', 'lion', 'monkey'],

columns=('class', 'max_speed'))

print(df)

print(df.reset_index())

输出：

class max_speed

falcon bird 389.0

parrot bird 24.0

lion manmal 80.5

monkey mammal NaN

index class max_speed

0 falcon bird 389.0

1 parrot bird 24.0

2 lion manmal 80.5

3 monkey mammal NaN

10、时间序列重采样

使用函数：resample，对原样本重新处理的一个方法，是一个对常规时间序列数据重新采样和频率转换的便捷的方法

示例：

index = pd.date_range('1/1/2000', periods=9, freq='T')

series = pd.Series(range(9), index=index)

print(series)

print(series.resample('3T').sum())

输出：

2000-01-01 00:00:00 0

2000-01-01 00:01:00 1

2000-01-01 00:02:00 2

2000-01-01 00:03:00 3

2000-01-01 00:04:00 4

2000-01-01 00:05:00 5

2000-01-01 00:06:00 6

2000-01-01 00:07:00 7

2000-01-01 00:08:00 8

Freq: T, dtype: int64

2000-01-01 00:00:00 3

2000-01-01 00:03:00 12

2000-01-01 00:06:00 21

Freq: 3T, dtype: int64

11、group分组和agg聚合

（1）group分组

df = pd.DataFrame({'Country':['China','China', 'India', 'India', 'America', 'Japan', 'China', 'India'],

'Income':[10000, 10000, 5000, 5002, 40000, 50000, 8000, 5000],

'Age':[5000, 4321, 1234, 4010, 250, 250, 4500, 4321]})

df_gb = df.groupby('Country') # 安装country进行分组

for index, data in df_gb:

print(index)

print(data)

输出：

America

Country Income Age

4 America 40000 250

China

Country Income Age

0 China 10000 5000

1 China 10000 4321

6 China 8000 4500

India

Country Income Age

2 India 5000 1234

3 India 5002 4010

7 India 5000 4321

Japan

Country Income Age

5 Japan 50000 250

（2）agg聚合：对分组后的数据进行聚合

df_agg = df.groupby('Country').agg({'Age': 'max', 'Income': 'mean'}) # 针对分组后的不同列采用不同的聚合方式

print(df_agg)

输出：

Age Income

Country

America 250 40000.000000

China 5000 9333.333333

India 4321 5000.666667

Japan 250 50000.000000

12、寻找频次最高的数据

使用mode方法，示例如下：

df = pd.DataFrame([('mammal', 2, 4),

('mammal', 4, 3),

('arthropod', 8, 2),

('bird', 2, 4)],

index=('falcon', 'horse', 'spider', 'ostrich'),

columns=('species', 'legs', 'wings'))

print(df)

print(df.mode())

输出：

species legs wings

falcon mammal 2 4

horse mammal 4 3

spider arthropod 8 2

ostrich bird 2 4

species legs wings

0 mammal 2 4

13、将分类数据变量转化为标识符变量

使用get_dummies方法，示例：

s = pd.Series(list('abca'))

df_term = pd.get_dummies(s, prefix='tt_')

print(df_term)

输出：

tt__a tt__b tt__c

0 1 0 0

1 0 1 0

2 0 0 1

3 1 0 0

三、 numpy库方法

1、复制数组N遍后flatten并排序

核心代码：

np.repeat(a, N)

2、np.where的使用

（1）取某个条件的索引

np.where(condition)

（2）满足某个条件赋值a，否则赋值b

np.where(condition, a, b)

3、np.empty输出

它返回一个随机的矩阵，每个元素需要重新赋值

4、置换数组排序

完整代码：

perm = np.random.permutation(10)

print(perm)

5、获取随机生成器np.random的状态

常与np.random.set_state()搭配使用，使随机生成器random保持相同的状态

完整代码：

x = [1,2,3,4,5,6,7,8,9]

y = [1,2,3,4,5,6,7,8,9]

state = np.random.get_state()

np.random.shuffle(x)

print(x)

np.random.set_state(state)

np.random.shuffle(y)

print(y)

返回为：

[4, 7, 6, 8, 3, 9, 1, 2, 5]

6、numpy中按轴连接2个数组形成新的数组

np.concatenate函数，与pd.concat功能一样的，只是后者只作用于dataframe对象

完整代码：

a = np.array([[1,2], [3,4]])

b = np.array([[6,7]])

c = np.concatenate((a,b), axis=0)

7、np.asfarray()功能说明：

多维数组做数值处理，同时对列表去转义符处理

asfarray = as float array

8、累加操作

使用cumsum方法

a = np.array([[1,2,3], [4,5,6]])

print(np.cumsum(a))

得到：[ 1 3 6 10 15 21]

9、获取排序的索引值，数据不变

使用np.argsort方法

import numpy as np

number = [1,3,5,7,9,2,4,6,8]

print(np.argsort(number))

返回：[0 5 1 6 2 7 3 8 4]

10、numpy并行化操作（vip）

如果x，y是长度一样的数组形式，那么x+y就是各个维度上的值相加。

同理，np.maximum就是各个维度上的值都取最大值，比如：

res = np.maximum(10, [12,3,14,4])

print(res)

得到： [12 10 14 10]

12、np.arange()间隔取值

res = np.arange(0, 10, 2)

返回：[0 2 4 6 8]

13、np.meshgrid()生成网络点坐标矩阵

X = np.arange(0, 4, 2)

Y = np.arange(10, 14, 2)

res = np.meshgrid(X, Y)

print(res)

[array([[0, 2],

[0, 2]]), array([[10, 10],

[12, 12]])]

该返回值将横坐标、纵坐标分别拆分到两个子数组中。

14、扁平化函数np.ravel()和np.flatten()

X = np.array([[1,2,3,4], [5,6,7,8]])

print(X.ravel())

[1 2 3 4 5 6 7 8]

ravel()和flatten()函数的区别：

两者的区别主要是在内存的使用上，ravel()是一个数组的视图，而flatten()分配了新的内存。

15、numpy增加一个维度

X = np.array([[1,2,3,4], [5,6,7,8]])[None]

print(X)

print(X.shape)

返回值：

[[[1 2 3 4]

[5 6 7 8]]]

(1, 2, 4)

从上面可以看出，通过[None]方法，将原(2,4)维度的数组变成了(1,2,4)

四、 scipy库方法

1、构建csr(compressed sparse row按行压缩)的稀疏矩阵

核心代码：

sparse.csr_matrix((data, (row_ind, col_ind)), shape)

完整代码:

row_ind = np.array([0, 0, 1, 1, 2])

col_ind = np.array([0, 2, 0, 1, 1])

data = np.arange(1, 6)

matrix = sparse.csr_matrix((data, (row_ind, col_ind)), shape=(3,3))

print(matrix.todense())

五、 tensorflow库方法

1、reduce_sum的使用

可参考:tf.reduce_sum和axis关系详解_裴大帅2021_新浪博客

Tf.reduce_sum(data, axis, keep_dims)

完整代码:

sess = tf.Session()

a = tf.constant([[1,2,3], [4,5,6], [6,6,6]])

b1 = tf.reduce_sum(a)

b2 = tf.reduce_sum(a, 0)

b3 = tf.reduce_sum(a, 1)

b4 = tf.reduce_sum(a, 1, keep_dims=True)

2、truncated_normal函数使用

从截断的正态分布中输出随机值，产生正态分布的值如果与均值的差值大于两倍的标准差，那就重新生成。和一般的正态分布(tf.random_normal)的产生随机数据比起来，这个函数产生的随机数与均值的差距不会超过两倍的标准差，但是一般的别的函数是可能的。

3、数组和tensor互转

使用covert_to_tensor函数，完整代码:

A = list([[1,2,3], [2,3,4]])

B = tf.convert_to_tensor(A)

with tf.Session() as sess:

print(type(A))

print(type(B))

print(sess.run(B))

C = B.eval_r() # 一定要在sess之下操作

print(type(C))

print(C)

4、通过indices获取params下标的张量

使用tf.gather_nd(params,indices...)函数，它与tf.gather函数的区别是后者的indices仅支持一维索引张量

完整代码:

with tf.Session() as sess:

thirdWeight = tf.truncated_normal([3,2,3])

vectorRight = tf.convert_to_tensor([[0,0,0], [0,0,1], [0,0,2]])

weightLeft = tf.gather_nd(thirdWeight, vectorRight)

print(sess.run([thirdWeight, weightLeft]))

5、tf.squeeze()

该函数会除去张量中形状为1的轴

完整代码：

k = tf.constant([[[[[[1]],[[2]],[[3]]]],[[[[4]],[[5]],[[6]]]]]])

with tf.Session() as sess:

print(sess.run(tf.shape(k)))

print(sess.run(tf.squeeze(k)))

6、寻找embedding data中对应行下的vector

使用函数：tf.nn.embedding_lookup，一个比较形象的处理方式：

完整代码：

c = tf.Variable(tf.random_normal([5,3]))

init = tf.global_variables_initializer()

with tf.Session() as sess:

sess.run(init)

print(sess.run([c, tf.nn.embedding_lookup(c, [3,4])]))

7、tf.multiply(a,b)使用说明：

a的shape为[N, M], b的shape为[N, 1]，则最后生成的结果的shape为[N, M]，相当于在每个N维中，M个数分别与1个数相乘。如果b的shape为[N, M]，则a的M个数分别于b的M个数相乘。b最后一维不能是1或者M以外的其他数，否则无法相乘。

8、tf.Graph()说明：

用于创建数据流图，本身不负责运行计算，这点区别于tf.Session()。如果没有显示指定张量和操作所属的计算图，则这些张量和操作属于默认计算图。

具体可参考:http://blog.sina.com.cn/s/blog_628cc2b70102yonj.html

9、l1和l2正则化现成的方法

tf.contrib.layers.l1_regularizer()

tf.contrib.layers.l2_regularizer()

完整代码：

weight = tf.constant([[1.0, 2.0], [3.0, 4.0]])

with tf.Session() as sess:

# 输出为(|1|+|-2|+|-3|+|4|)*0.5=5

print(sess.run(tf.contrib.layers.l1_regularizer(0.5)(weight)))

# 输出为(1²+(-2)²+(-3)²+4²)/2*0.5=7.5

# TensorFlow会将L2的正则化损失值除以2使得求导得到的结果更加简洁

print(sess.run(tf.contrib.layers.l2_regularizer(0.5)(weight)))

六、 sklearn库方法

1、K折交叉验证

使用sklearn.model_selection.StratifiedKFold，尽量使训练集、测试集中各类别样本的比例与原始数据集中相同。

完整代码：

from sklearn.model_selection import StratifiedKFold

X = np.array([[1,2], [3,4], [5,6], [7,8], [9,10], [11,12]])

y = np.array([1,1,1,2,2,2])

skf = StratifiedKFold(n_splits=3)

print(skf)

for train_index, test_index in skf.split(X, y):

print("train index:", train_index, ",test index:", test_index)

X_train,X_test = X[train_index], X[test_index]

y_train,y_test = y[train_index], y[test_index]

print(X_train, X_test, y_train, y_test)

七、 matplotlib库方法

1、图中添加多条数据线

fig = plt.figure()

fig.add_subplot()

plt.plot(timeseries, color='blue', label='Original')

plt.plot(rolmean, color='red', label='rolling mean')

plt.plot(rolstd, color='black', label='rolling std')

plt.legend(loc='best')

plt.title('Rolling Mean & Standard Deviation')

plt.show()