0
点赞
收藏
分享

微信扫一扫

数据预处理(2)—— One-hot coding 独热编码#分别使用 pandans.dummies 和 sklearn.feature_extraction.DictVectorizer 进行处理


离散 feature 的 encoding 分为两种情况:

1、离散 feature 的取值之间没有大小的意义,比如color:[red,blue],那么就使用 one-hot encoding

2、离散 feature 的取值有大小的意义,比如size:[X,XL,XXL],那么就使用数值的映射{X:1,XL:2,XXL:3}

In [90]:

import numpy as np

import pandas as pd

from pandas import Series, DataFrame

np.set_printoptions(precision=4)


×



In [91]:


df = pd.DataFrame([

['green', 'M', 10.1, 'class1'],

['red', 'L', 13.5, 'class2'],

['blue', 'XL', 15.3, 'class1']])

df.columns = ['color', 'size', 'prize', 'class label']

df


×


Out[91]:


color

size

prize

class label

0

green

M

10.1

class1

1

red

L

13.5

class2

2

blue

XL

15.3

class1





In [92]:

size_mapping = {

'XL': 3,

'L': 2,

'M': 1}

df['size'] = df['size'].map(size_mapping)

df





×


Out[92]:



color

size

prize

class label

0

green

1

10.1

class1

1

red

2

13.5

class2

2

blue

3

15.3

class1





# -----------------------------------------------

# 使用 pd.get_dummies() 进行处理

pd.get_dummies(df)





×





Out[93]:


size

prize

color_blue

color_green

color_red

class label_class1

class label_class2

0

1

10.1

0

1

0

1

0

1

2

13.5

0

0

1

0

1

2

3

15.3

1

0

0

1

0





In [94]:




df


×





Out[94]:



color

size

prize

class label

0

green

1

10.1

class1

1

red

2

13.5

class2

2

blue

3

15.3

class1





In [95]:



x

# -----------------------------------------------

# 使用  sklearn.feature_extraction.DictVectorizer 进行处理

feature_list = []

label_list = []

for row in df.index[:]:

label_list.append(df.ix[row][-1])

rowDict = {}

for i in range(0, len(df.ix[row])-1):

rowDict[df.columns[i]] = df.ix[row][i]

feature_list.append(rowDict)

feature_list

×

Out[95]:



[{'color': 'green', 'prize': 10.1, 'size': 1},
 {'color': 'red', 'prize': 13.5, 'size': 2},
 {'color': 'blue', 'prize': 15.300000000000001, 'size': 3}]




In [96]:



label_list


×


Out[96]:



['class1', 'class2', 'class1']




In [97]:


from sklearn.feature_extraction import DictVectorizer

vec = DictVectorizer()

# DictVectorizer.fit_transform() 接受一个由 dict 组成的 list

dummy_x = vec.fit_transform(feature_list).toarray()

dummy_x



×

Out[97]:


array([[  0. ,   1. ,   0. ,  10.1,   1. ],
       [  0. ,   0. ,   1. ,  13.5,   2. ],
       [  1. ,   0. ,   0. ,  15.3,   3. ]])




In [98]:


from sklearn import preprocessing

label_bin = preprocessing.LabelBinarizer()

# preprocessing.LabelBinarizer.fit_transform() 接受一个 list

dummy_y = label_bin.fit_transform(label_list)

dummy_y



×

Out[98]:


array([[0],
       [1],
       [0]])



In [99]:



# 测试 当 label 种类大于 2 的时候的效果
df['class label'][2] = 'class3'
df

×

C:\Users\rHotD\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy from ipykernel import kernelapp as app


Out[99]:



color

size

prize

class label

0

green

1

10.1

class1

1

red

2

13.5

class2

2

blue

3

15.3

class3





In [100]:

feature_list = []
label_list = []
for row in df.index[:]:
    label_list.append(df.ix[row][-1])
    rowDict = {}
    for i in range(0, len(df.ix[row])-1):
        rowDict[df.columns[i]] = df.ix[row][i]
    feature_list.append(rowDict)
dummy_y = label_bin.fit_transform(label_list)
dummy_y

×


Out[100]:



array([[1, 0, 0],
       [0, 1, 0],
       [0, 0, 1]])




In [ ]:




# 结论,两者效果差不多一样,但是 pd.get_dummies 更好用一些


×


举报

相关推荐

0 条评论