官术网_书友最值得收藏!

Encoding the categorical variables

One of the main constraints of scikit-learn is that you cannot implement the machine learning algorithms on columns that are categorical in nature. For example, the type column in our dataset has five categories:

  • CASH-IN
  • CASH-OUT
  • DEBIT
  • PAYMENT
  • TRANSFER

These categories will have to be encoded into numbers that scikit-learn can make sense of. In order to do this, we have to implement a two-step process. 

The first step is to convert each category into a number: CASH-IN = 0, CASH-OUT = 1, DEBIT = 2, PAYMENT = 3, TRANSFER = 4. We can do this by using the following code:

#Package Imports

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

#Converting the type column to categorical

df['type'] = df['type'].astype('category')

#Integer Encoding the 'type' column

type_encode = LabelEncoder()

#Integer encoding the 'type' column

df['type'] = type_encode.fit_transform(df.type)

The code first coverts the type column to a categorical feature. We then use LabelEncoder() in order to initialize an integer encoder object that is called type_encodeFinally, we apply the fit_transform method on the type column in order to convert each category into a number.

Broadly speaking, there are two types of categorical variables:

  • Nominal 
  • Ordinal

Nominal categorical variables have no inherent order to them. An example of the nominal type of categorical variable is the type column. 

Ordinal categorical variables have an inherent order to them. An example of the ordinal type of categorical variable is Education Level, in which people with a Master's degree will have a higher order/value compared to people with a Undergraduate degree only. 

In the case of ordinal categorical variables, integer encoding, as illustrated previously, is sufficient and we do not need to one-hot encode them. Since the type column is a nominal categorical variable, we have to one-hot encode it after integer encoding it. This is done by using the following code: 

#One hot encoding the 'type' column

type_one_hot = OneHotEncoder()

type_one_hot_encode = type_one_hot.fit_transform(df.type.values.reshape(-1,1)).toarray()

#Adding the one hot encoded variables to the dataset

ohe_variable = pd.DataFrame(type_one_hot_encode, columns = ["type_"+str(int(i)) for i in range(type_one_hot_encode.shape[1])])

df = pd.concat([df, ohe_variable], axis=1)

#Dropping the original type variable

df = df.drop('type', axis = 1)

#Viewing the new dataframe after one-hot-encoding

df.head()

In the code, we first create a one-hot encoding object called type_one_hotWe then transform the type column into one-hot encoded columns by using the 
fit_transform method. 

We have five categories that are represented by integers 0 to 4. Each of these five categories will now get its own column. Therefore, we create five columns called type_0, type_1, type_2, type_3, and type_4. Each of these five columns is represented by two values: 1, which indicates the presence of that category, and 0, which indicates the absence of that category. 

This information is stored in the ohe_variableSince this variable holds the five columns, we will join this to the original dataframe by using the concat method from pandas

The ordinal type column is then dropped from the dataframe as this column is now redundant post one hot encoding. The final dataframe now looks like this:

主站蜘蛛池模板: 萍乡市| 安龙县| 云南省| 六安市| 浦北县| 陆丰市| 怀柔区| 方城县| 喜德县| 定结县| 安图县| 天长市| 永州市| 吴桥县| 景宁| 左云县| 自治县| 寻甸| 治县。| 青岛市| 象山县| 厦门市| 怀安县| 伊宁市| 观塘区| 康定县| 莲花县| 杨浦区| 福海县| 沙坪坝区| 古丈县| 额尔古纳市| 宁远县| 秀山| 永和县| 镇原县| 咸宁市| 深泽县| 南川市| 迁西县| 高安市|