- Machine Learning with scikit:learn Quick Start Guide
- Kevin Jolly
- 546字
- 2021-06-24 18:15:55
Encoding the categorical variables
One of the main constraints of scikit-learn is that you cannot implement the machine learning algorithms on columns that are categorical in nature. For example, the type column in our dataset has five categories:
- CASH-IN
- CASH-OUT
- DEBIT
- PAYMENT
- TRANSFER
These categories will have to be encoded into numbers that scikit-learn can make sense of. In order to do this, we have to implement a two-step process.
The first step is to convert each category into a number: CASH-IN = 0, CASH-OUT = 1, DEBIT = 2, PAYMENT = 3, TRANSFER = 4. We can do this by using the following code:
#Package Imports
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
#Converting the type column to categorical
df['type'] = df['type'].astype('category')
#Integer Encoding the 'type' column
type_encode = LabelEncoder()
#Integer encoding the 'type' column
df['type'] = type_encode.fit_transform(df.type)
The code first coverts the type column to a categorical feature. We then use LabelEncoder() in order to initialize an integer encoder object that is called type_encode. Finally, we apply the fit_transform method on the type column in order to convert each category into a number.
Broadly speaking, there are two types of categorical variables:
- Nominal
- Ordinal
Nominal categorical variables have no inherent order to them. An example of the nominal type of categorical variable is the type column.
Ordinal categorical variables have an inherent order to them. An example of the ordinal type of categorical variable is Education Level, in which people with a Master's degree will have a higher order/value compared to people with a Undergraduate degree only.
In the case of ordinal categorical variables, integer encoding, as illustrated previously, is sufficient and we do not need to one-hot encode them. Since the type column is a nominal categorical variable, we have to one-hot encode it after integer encoding it. This is done by using the following code:
#One hot encoding the 'type' column
type_one_hot = OneHotEncoder()
type_one_hot_encode = type_one_hot.fit_transform(df.type.values.reshape(-1,1)).toarray()
#Adding the one hot encoded variables to the dataset
ohe_variable = pd.DataFrame(type_one_hot_encode, columns = ["type_"+str(int(i)) for i in range(type_one_hot_encode.shape[1])])
df = pd.concat([df, ohe_variable], axis=1)
#Dropping the original type variable
df = df.drop('type', axis = 1)
#Viewing the new dataframe after one-hot-encoding
df.head()
In the code, we first create a one-hot encoding object called type_one_hot. We then transform the type column into one-hot encoded columns by using the
fit_transform method.
We have five categories that are represented by integers 0 to 4. Each of these five categories will now get its own column. Therefore, we create five columns called type_0, type_1, type_2, type_3, and type_4. Each of these five columns is represented by two values: 1, which indicates the presence of that category, and 0, which indicates the absence of that category.
This information is stored in the ohe_variable. Since this variable holds the five columns, we will join this to the original dataframe by using the concat method from pandas.
The ordinal type column is then dropped from the dataframe as this column is now redundant post one hot encoding. The final dataframe now looks like this: