書名： Machine Learning with scikit：learn Quick Start Guide
作者名： Kevin Jolly
本章字數： 546字
更新時間： 2021-06-24 18:15:55

Encoding the categorical variables

One of the main constraints of scikit-learn is that you cannot implement the machine learning algorithms on columns that are categorical in nature. For example, the type column in our dataset has five categories:

CASH-IN
CASH-OUT
DEBIT
PAYMENT
TRANSFER

These categories will have to be encoded into numbers that scikit-learn can make sense of. In order to do this, we have to implement a two-step process.

The first step is to convert each category into a number: CASH-IN = 0, CASH-OUT = 1, DEBIT = 2, PAYMENT = 3, TRANSFER = 4. We can do this by using the following code:

#Package Imports

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

#Converting the type column to categorical

df['type'] = df['type'].astype('category')

#Integer Encoding the 'type' column

type_encode = LabelEncoder()

#Integer encoding the 'type' column

df['type'] = type_encode.fit_transform(df.type)

The code first coverts the type column to a categorical feature. We then use LabelEncoder() in order to initialize an integer encoder object that is called type_encode. Finally, we apply the fit_transform method on the type column in order to convert each category into a number.

Broadly speaking, there are two types of categorical variables:

Nominal
Ordinal

Nominal categorical variables have no inherent order to them. An example of the nominal type of categorical variable is the type column.

Ordinal categorical variables have an inherent order to them. An example of the ordinal type of categorical variable is Education Level, in which people with a Master's degree will have a higher order/value compared to people with a Undergraduate degree only.

In the case of ordinal categorical variables, integer encoding, as illustrated previously, is sufficient and we do not need to one-hot encode them. Since the type column is a nominal categorical variable, we have to one-hot encode it after integer encoding it. This is done by using the following code:

#One hot encoding the 'type' column

type_one_hot = OneHotEncoder()

type_one_hot_encode = type_one_hot.fit_transform(df.type.values.reshape(-1,1)).toarray()

#Adding the one hot encoded variables to the dataset 

ohe_variable = pd.DataFrame(type_one_hot_encode, columns = ["type_"+str(int(i)) for i in range(type_one_hot_encode.shape[1])])

df = pd.concat([df, ohe_variable], axis=1)

#Dropping the original type variable 

df = df.drop('type', axis = 1)

#Viewing the new dataframe after one-hot-encoding 

df.head()

In the code, we first create a one-hot encoding object called type_one_hot. We then transform the type column into one-hot encoded columns by using the
fit_transform method.

We have five categories that are represented by integers 0 to 4. Each of these five categories will now get its own column. Therefore, we create five columns called type_0, type_1, type_2, type_3, and type_4. Each of these five columns is represented by two values: 1, which indicates the presence of that category, and 0, which indicates the absence of that category.

This information is stored in the ohe_variable. Since this variable holds the five columns, we will join this to the original dataframe by using the concat method from pandas.

The ordinal type column is then dropped from the dataframe as this column is now redundant post one hot encoding. The final dataframe now looks like this:

官术网_书友最值得收藏!

Machine Learning with scikit：learn Quick Start Guide

Encoding the categorical variables