官术网_书友最值得收藏!

Encoding the categorical variables

One of the main constraints of scikit-learn is that you cannot implement the machine learning algorithms on columns that are categorical in nature. For example, the type column in our dataset has five categories:

  • CASH-IN
  • CASH-OUT
  • DEBIT
  • PAYMENT
  • TRANSFER

These categories will have to be encoded into numbers that scikit-learn can make sense of. In order to do this, we have to implement a two-step process. 

The first step is to convert each category into a number: CASH-IN = 0, CASH-OUT = 1, DEBIT = 2, PAYMENT = 3, TRANSFER = 4. We can do this by using the following code:

#Package Imports

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

#Converting the type column to categorical

df['type'] = df['type'].astype('category')

#Integer Encoding the 'type' column

type_encode = LabelEncoder()

#Integer encoding the 'type' column

df['type'] = type_encode.fit_transform(df.type)

The code first coverts the type column to a categorical feature. We then use LabelEncoder() in order to initialize an integer encoder object that is called type_encodeFinally, we apply the fit_transform method on the type column in order to convert each category into a number.

Broadly speaking, there are two types of categorical variables:

  • Nominal 
  • Ordinal

Nominal categorical variables have no inherent order to them. An example of the nominal type of categorical variable is the type column. 

Ordinal categorical variables have an inherent order to them. An example of the ordinal type of categorical variable is Education Level, in which people with a Master's degree will have a higher order/value compared to people with a Undergraduate degree only. 

In the case of ordinal categorical variables, integer encoding, as illustrated previously, is sufficient and we do not need to one-hot encode them. Since the type column is a nominal categorical variable, we have to one-hot encode it after integer encoding it. This is done by using the following code: 

#One hot encoding the 'type' column

type_one_hot = OneHotEncoder()

type_one_hot_encode = type_one_hot.fit_transform(df.type.values.reshape(-1,1)).toarray()

#Adding the one hot encoded variables to the dataset

ohe_variable = pd.DataFrame(type_one_hot_encode, columns = ["type_"+str(int(i)) for i in range(type_one_hot_encode.shape[1])])

df = pd.concat([df, ohe_variable], axis=1)

#Dropping the original type variable

df = df.drop('type', axis = 1)

#Viewing the new dataframe after one-hot-encoding

df.head()

In the code, we first create a one-hot encoding object called type_one_hotWe then transform the type column into one-hot encoded columns by using the 
fit_transform method. 

We have five categories that are represented by integers 0 to 4. Each of these five categories will now get its own column. Therefore, we create five columns called type_0, type_1, type_2, type_3, and type_4. Each of these five columns is represented by two values: 1, which indicates the presence of that category, and 0, which indicates the absence of that category. 

This information is stored in the ohe_variableSince this variable holds the five columns, we will join this to the original dataframe by using the concat method from pandas

The ordinal type column is then dropped from the dataframe as this column is now redundant post one hot encoding. The final dataframe now looks like this:

主站蜘蛛池模板: 左贡县| 云南省| 商丘市| 白城市| 巨野县| 义乌市| 赣州市| 沾化县| 廊坊市| 堆龙德庆县| 正蓝旗| 赤水市| 宜阳县| 镇江市| 水城县| 迁安市| 柳江县| 体育| 东兴市| 乌鲁木齐市| 遂溪县| 女性| 察雅县| 崇州市| 澄江县| 涿州市| 盐城市| 印江| 忻州市| 白朗县| 湘阴县| 六枝特区| 甘德县| 密山市| 深圳市| 郴州市| 喜德县| 白玉县| 麦盖提县| 岢岚县| 凉城县|