数据预处理（2）—— One-hot coding 独热编码#分别使用 pandans.dummies 和 sklearn.feature

离散 feature 的 encoding 分为两种情况：

1、离散 feature 的取值之间没有大小的意义，比如color：[red,blue],那么就使用 one-hot encoding

2、离散 feature 的取值有大小的意义，比如size:[X,XL,XXL],那么就使用数值的映射{X:1,XL:2,XXL:3}

In [90]:

import numpy as np

import pandas as pd

from pandas import Series, DataFrame

np.set_printoptions(precision=4)

…

In [91]:

df = pd.DataFrame([

['green', 'M', 10.1, 'class1'],

['red', 'L', 13.5, 'class2'],

['blue', 'XL', 15.3, 'class1']])

df.columns = ['color', 'size', 'prize', 'class label']

df

Out[91]:

	color	size	prize	class label
0	green	M	10.1	class1
1	red	L	13.5	class2
2	blue	XL	15.3	class1

…

In [92]:

size_mapping = {

'XL': 3,

'L': 2,

'M': 1}

df['size'] = df['size'].map(size_mapping)

df

Out[92]:

	color	size	prize	class label
0	green	1	10.1	class1
1	red	2	13.5	class2
2	blue	3	15.3	class1

…

# -----------------------------------------------

# 使用 pd.get_dummies() 进行处理

pd.get_dummies(df)

Out[93]:

	size	prize	color_blue	color_green	color_red	class label_class1	class label_class2
0	1	10.1	0	1	0	1	0
1	2	13.5	0	0	1	0	1
2	3	15.3	1	0	0	1	0

…

In [94]:

df

Out[94]:

	color	size	prize	class label
0	green	1	10.1	class1
1	red	2	13.5	class2
2	blue	3	15.3	class1

…

In [95]:

# -----------------------------------------------

# 使用  sklearn.feature_extraction.DictVectorizer 进行处理

feature_list = []

label_list = []

for row in df.index[:]:

label_list.append(df.ix[row][-1])

rowDict = {}

for i in range(0, len(df.ix[row])-1):

rowDict[df.columns[i]] = df.ix[row][i]

feature_list.append(rowDict)

feature_list

Out[95]:

[{'color': 'green', 'prize': 10.1, 'size': 1},
 {'color': 'red', 'prize': 13.5, 'size': 2},
 {'color': 'blue', 'prize': 15.300000000000001, 'size': 3}]

…

In [96]:

label_list

Out[96]:

['class1', 'class2', 'class1']

…

In [97]:

from sklearn.feature_extraction import DictVectorizer

vec = DictVectorizer()

# DictVectorizer.fit_transform() 接受一个由 dict 组成的 list

dummy_x = vec.fit_transform(feature_list).toarray()

dummy_x

Out[97]:

array([[  0. ,   1. ,   0. ,  10.1,   1. ],
       [  0. ,   0. ,   1. ,  13.5,   2. ],
       [  1. ,   0. ,   0. ,  15.3,   3. ]])

…

In [98]:

from sklearn import preprocessing

label_bin = preprocessing.LabelBinarizer()

# preprocessing.LabelBinarizer.fit_transform() 接受一个 list

dummy_y = label_bin.fit_transform(label_list)

dummy_y

Out[98]:

array([[0],
       [1],
       [0]])

…

In [99]:

# 测试 当 label 种类大于 2 的时候的效果
df['class label'][2] = 'class3'
df

C:\Users\rHotD\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy from ipykernel import kernelapp as app

Out[99]:

	color	size	prize	class label
0	green	1	10.1	class1
1	red	2	13.5	class2
2	blue	3	15.3	class3

…

In [100]:

feature_list = []
label_list = []
for row in df.index[:]:
    label_list.append(df.ix[row][-1])
    rowDict = {}
    for i in range(0, len(df.ix[row])-1):
        rowDict[df.columns[i]] = df.ix[row][i]
    feature_list.append(rowDict)
dummy_y = label_bin.fit_transform(label_list)
dummy_y

Out[100]:

array([[1, 0, 0],
       [0, 1, 0],
       [0, 0, 1]])

…

In [ ]:

# 结论，两者效果差不多一样，但是 pd.get_dummies 更好用一些

…

数据预处理（2）—— One-hot coding 独热编码#分别使用 pandans.dummies 和 sklearn.feature_extraction.DictVectorizer 进行处理

离散 feature 的 encoding 分为两种情况：