官术网_书友最值得收藏!

Creating dummy variables

Creating dummy variables is a method to create separate variable for each category of a categorical variable., Although, the categorical variable contains plenty of information and might show a causal relationship with output variable, it can't be used in the predictive models like linear and logistic regression without any processing.

In our dataset, sex is a categorical variable with two categories that are male and female. We can create two dummy variables out of this, as follows:

dummy_sex=pd.get_dummies(data['sex'],prefix='sex')

The result of this statement is, as follows:

Fig. 2.17: Dummy variable for the sex variable in the Titanic dataset

This process is called dummifying, the variable creates two new variables that take either 1 or 0 value depending on what the sex of the passenger was. If the sex was female, sex_female would be 1 and sex_male would be 0. If the sex was male, sex_male would be 1 and sex_female would be 0. In general, all but one dummy variable in a row will have a 0 value. The variable derived from the value (for that row) in the original column will have a value of 1.

These two new variables can be joined to the source data frame, so that they can be used in the models. The method to that is illustrated, as follows:

column_name=data.columns.values.tolist()
column_name.remove('sex')
data[column_name].join(dummy_sex)

The column names are converted to a list and the sex is removed from the list before joining these two dummy variables to the dataset, as it will not make sense to have a sex variable with these two dummy variables.

主站蜘蛛池模板: 敦煌市| 宾阳县| 双桥区| 定州市| 海淀区| 石家庄市| 泾川县| 绍兴县| 赣榆县| 台中市| 营山县| 平远县| 珲春市| 桑日县| 高阳县| 玉山县| 武邑县| 浦北县| 永和县| 孝昌县| 七台河市| 岗巴县| 承德市| 富顺县| 福海县| 松溪县| 旬邑县| 遂昌县| 筠连县| 平远县| 宁陕县| 安达市| 禹城市| 深水埗区| 莲花县| 邛崃市| 阳江市| 恩平市| 祁阳县| 金溪县| 万安县|