离散 feature 的 encoding 分为两种情况:
1、离散 feature 的取值之间没有大小的意义,比如color:[red,blue],那么就使用 one-hot encoding
2、离散 feature 的取值有大小的意义,比如size:[X,XL,XXL],那么就使用数值的映射{X:1,XL:2,XXL:3}
In [90]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
np.set_printoptions(precision=4)
×
…
In [91]:
df = pd.DataFrame([
['green', 'M', 10.1, 'class1'],
['red', 'L', 13.5, 'class2'],
['blue', 'XL', 15.3, 'class1']])
df.columns = ['color', 'size', 'prize', 'class label']
df
×
Out[91]:
color | size | prize | class label | |
0 | green | M | 10.1 | class1 |
1 | red | L | 13.5 | class2 |
2 | blue | XL | 15.3 | class1 |
…
In [92]:
size_mapping = {
'XL': 3,
'L': 2,
'M': 1}
df['size'] = df['size'].map(size_mapping)
df
×
Out[92]:
color | size | prize | class label | |
0 | green | 1 | 10.1 | class1 |
1 | red | 2 | 13.5 | class2 |
2 | blue | 3 | 15.3 | class1 |
…
# -----------------------------------------------
# 使用 pd.get_dummies() 进行处理
pd.get_dummies(df)
×
Out[93]:
size | prize | color_blue | color_green | color_red | class label_class1 | class label_class2 | |
0 | 1 | 10.1 | 0 | 1 | 0 | 1 | 0 |
1 | 2 | 13.5 | 0 | 0 | 1 | 0 | 1 |
2 | 3 | 15.3 | 1 | 0 | 0 | 1 | 0 |
…
In [94]:
df
×
Out[94]:
color | size | prize | class label | |
0 | green | 1 | 10.1 | class1 |
1 | red | 2 | 13.5 | class2 |
2 | blue | 3 | 15.3 | class1 |
…
In [95]:
x
# -----------------------------------------------
# 使用 sklearn.feature_extraction.DictVectorizer 进行处理
feature_list = []
label_list = []
for row in df.index[:]:
label_list.append(df.ix[row][-1])
rowDict = {}
for i in range(0, len(df.ix[row])-1):
rowDict[df.columns[i]] = df.ix[row][i]
feature_list.append(rowDict)
feature_list
×
Out[95]:
[{'color': 'green', 'prize': 10.1, 'size': 1},
{'color': 'red', 'prize': 13.5, 'size': 2},
{'color': 'blue', 'prize': 15.300000000000001, 'size': 3}]
…
In [96]:
label_list
×
Out[96]:
['class1', 'class2', 'class1']
…
In [97]:
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer()
# DictVectorizer.fit_transform() 接受一个由 dict 组成的 list
dummy_x = vec.fit_transform(feature_list).toarray()
dummy_x
×
Out[97]:
array([[ 0. , 1. , 0. , 10.1, 1. ],
[ 0. , 0. , 1. , 13.5, 2. ],
[ 1. , 0. , 0. , 15.3, 3. ]])
…
In [98]:
from sklearn import preprocessing
label_bin = preprocessing.LabelBinarizer()
# preprocessing.LabelBinarizer.fit_transform() 接受一个 list
dummy_y = label_bin.fit_transform(label_list)
dummy_y
×
Out[98]:
array([[0],
[1],
[0]])
…
In [99]:
# 测试 当 label 种类大于 2 的时候的效果
df['class label'][2] = 'class3'
df
×
C:\Users\rHotD\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy from ipykernel import kernelapp as app
Out[99]:
color | size | prize | class label | |
0 | green | 1 | 10.1 | class1 |
1 | red | 2 | 13.5 | class2 |
2 | blue | 3 | 15.3 | class3 |
…
In [100]:
feature_list = []
label_list = []
for row in df.index[:]:
label_list.append(df.ix[row][-1])
rowDict = {}
for i in range(0, len(df.ix[row])-1):
rowDict[df.columns[i]] = df.ix[row][i]
feature_list.append(rowDict)
dummy_y = label_bin.fit_transform(label_list)
dummy_y
×
Out[100]:
array([[1, 0, 0],
[0, 1, 0],
[0, 0, 1]])
…
In [ ]:
# 结论,两者效果差不多一样,但是 pd.get_dummies 更好用一些
×
…