官术网_书友最值得收藏!

Dummy features

These variables are also known as categorical or binary features. This approach will be a good choice if we have a small number of distinct values for the feature to be transformed. In the Titanic data samples, the Embarked feature has only three distinct values (S, C, and Q) that occur frequently. So, we can transform the Embarked feature into three dummy variables, ('Embarked_S', 'Embarked_C', and 'Embarked_Q') to be able to use the random forest classifier.

The following code will show you how to do this kind of transformation:

# constructing binary features
def process_embarked():
global df_titanic_data

# replacing the missing values with the most common value in the variable
df_titanic_data.Embarked[df.Embarked.isnull()] = df_titanic_data.Embarked.dropna().mode().values

# converting the values into numbers
df_titanic_data['Embarked'] = pd.factorize(df_titanic_data['Embarked'])[0]

# binarizing the constructed features
if keep_binary:
df_titanic_data = pd.concat([df_titanic_data, pd.get_dummies(df_titanic_data['Embarked']).rename(
columns=lambda x: 'Embarked_' + str(x))], axis=1)
主站蜘蛛池模板: 庄浪县| 阜南县| 丹棱县| 石狮市| 乐都县| 诸城市| 彭泽县| 克什克腾旗| 德庆县| 呼玛县| 胶南市| 根河市| 西林县| 大埔区| 电白县| 麟游县| 屏边| 津市市| 东兴市| 南开区| 洛阳市| 新竹县| 宁远县| 田东县| 拉孜县| 河西区| 常山县| 镇康县| 祥云县| 福海县| 景谷| 陇南市| 马边| 天门市| 孟津县| 阳谷县| 恩施市| 潍坊市| 孙吴县| 晋中市| 台北市|