官术网_书友最值得收藏!

Managing categorical data

In many classification problems, the target dataset is made up of categorical labels which cannot immediately be processed by any algorithm. An encoding is needed and scikit-learn offers at least two valid options. Let's consider a very small dataset made of 10 categorical samples with two features each:

import numpy as np

>>> X = np.random.uniform(0.0, 1.0, size=(10, 2))
>>> Y = np.random.choice(('Male','Female'), size=(10))
>>> X[0]
array([ 0.8236887 , 0.11975305])
>>> Y[0]
'Male'

The first option is to use the LabelEncoder class, which adopts a dictionary-oriented approach, associating to each category label a progressive integer number, that is an index of an instance array called classes_:

from sklearn.preprocessing import LabelEncoder

>>> le = LabelEncoder()
>>> yt = le.fit_transform(Y)
>>> print(yt)
[0 0 0 1 0 1 1 0 0 1]

>>> le.classes_array(['Female', 'Male'], dtype='|S6')

The inverse transformation can be obtained in this simple way:

>>> output = [1, 0, 1, 1, 0, 0]
>>> decoded_output = [le.classes_[i] for i in output]
['Male', 'Female', 'Male', 'Male', 'Female', 'Female']

This approach is simple and works well in many cases, but it has a drawback: all labels are turned into sequential numbers. A classifier which works with real values will then consider similar numbers according to their distance, without any concern for semantics. For this reason, it's often preferable to use so-called one-hot encoding, which binarizes the data. For labels, it can be achieved using the LabelBinarizer class:

from sklearn.preprocessing import LabelBinarizer

>>> lb = LabelBinarizer()
>>> Yb = lb.fit_transform(Y)
array([[1],
[0],
[1],
[1],
[1],
[1],
[0],
[1],
[1],
[1]])

>>> lb.inverse_transform(Yb)
array(['Male', 'Female', 'Male', 'Male', 'Male', 'Male', 'Female', 'Male',
'Male', 'Male'], dtype='|S6')

In this case, each categorical label is first turned into a positive integer and then transformed into a vector where only one feature is 1 while all the others are 0. It means, for example, that using a softmax distribution with a peak corresponding to the main class can be easily turned into a discrete vector where the only non-null element corresponds to the right class. For example:

import numpy as np

>>> Y = lb.fit_transform(Y)
array([[0, 1, 0, 0, 0],
[0, 0, 0, 1, 0],
[1, 0, 0, 0, 0]])

>>> Yp = model.predict(X[0])
array([[0.002, 0.991, 0.001, 0.005, 0.001]])

>>> Ypr = np.round(Yp)
array([[ 0., 1., 0., 0., 0.]])

>>> lb.inverse_transform(Ypr)
array(['Female'], dtype='|S6')

Another approach to categorical features can be adopted when they're structured like a list of dictionaries (not necessarily dense, they can have values only for a few features). For example:

data = [
{ 'feature_1': 10.0, 'feature_2': 15.0 },
{ 'feature_1': -5.0, 'feature_3': 22.0 },
{ 'feature_3': -2.0, 'feature_4': 10.0 }
]

In this case, scikit-learn offers the classes DictVectorizer and FeatureHasher; they both produce sparse matrices of real numbers that can be fed into any machine learning model. The latter has a limited memory consumption and adopts MurmurHash 3 (read https://en.wikipedia.org/wiki/MurmurHash, for further information). The code for these two methods is shown as follows:

from sklearn.feature_extraction import DictVectorizer, FeatureHasher

>>> dv = DictVectorizer()
>>> Y_dict = dv.fit_transform(data)

>>> Y_dict.todense()
matrix([[ 10., 15., 0., 0.],
[ -5., 0., 22., 0.],
[ 0., 0., -2., 10.]])

>>> dv.vocabulary_
{'feature_1': 0, 'feature_2': 1, 'feature_3': 2, 'feature_4': 3}

>>> fh = FeatureHasher()
>>> Y_hashed = fh.fit_transform(data)

>>> Y_hashed.todense()
matrix([[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.]])

In both cases, I suggest you read the original scikit-learn documentation to know all possible options and parameters. 

When working with categorical features (normally converted into positive integers through LabelEncoder), it's also possible to filter the dataset in order to apply one-hot encoding using the OneHotEncoder class. In the following example, the first feature is a binary index which indicates 'Male' or 'Female':

from sklearn.preprocessing import OneHotEncoder

>>> data = [
[0, 10],
[1, 11],
[1, 8],
[0, 12],
[0, 15]
]

>>> oh = OneHotEncoder(categorical_features=[0])
>>> Y_oh = oh.fit_transform(data1)

>>> Y_oh.todense()
matrix([[ 1., 0., 10.],
[ 0., 1., 11.],
[ 0., 1., 8.],
[ 1., 0., 12.],
[ 1., 0., 15.]])

Considering that these approaches increase the number of values (also exponentially with binary versions), all the classes adopt sparse matrices based on SciPy implementation. See https://docs.scipy.org/doc/scipy-0.18.1/reference/sparse.html  for further information.

主站蜘蛛池模板: 洮南市| 焉耆| 湘阴县| 商城县| 泽库县| 桃江县| 察雅县| 遂昌县| 高清| 丽水市| 纳雍县| 新田县| 读书| 肇源县| 宁陵县| 天气| 石门县| 娄烦县| 都昌县| 合川市| 黎川县| 石楼县| 诸暨市| 固镇县| 五常市| 穆棱市| 西青区| 永登县| 平乡县| 呼图壁县| 绿春县| 乐山市| 安图县| 桃园市| 定安县| 扎鲁特旗| 德兴市| 基隆市| 长沙市| 姜堰市| 库车县|