官术网_书友最值得收藏!

Encoding categorical variables

The final step on the road to prepare the data for the exploratory phase is to bin categorical variables. Some software packages do this behind the scenes, but it is good to understand when and how to do it.

Any statistical model can accept only numerical data. Categorical data (sometimes can be expressed as digits depending on the context) cannot be used in a model straightaway. To use them, we encode them, that is, give them a unique numerical code. This is to explain when. As for how—you can use the following recipe.

Getting ready

To execute this recipe, you will need the pandas module.

No other prerequisites are required.

How to do it…

Once again, pandas already has a method that does all of this for us (the data_dummy_code.py file):

# dummy code the column with the type of the property
csv_read = pd.get_dummies(
    csv_read,
    prefix='d',
    columns=['type']
)

How it works…

The .get_dummies(...) method converts categorical variables into dummy variables. For example, consider a variable with three different levels:

1  One
2  Two
3  Three

We will need three columns to code it:

1  One  1  0  0
2  Two  0   1  0
3  Three  0  0  1

Sometimes, we can get away with using only two additional columns. However, we can use this trick only if one of the levels is, effectively, null:

1  One  1  0
2  Two  0  1
3  Zero  0  0

The first parameter to the .get_dummies(...) method is the DataFrame. The columns parameter specifies the column (or columns, as we can also pass a list) in the DataFrame to the dummy code. Specifying the prefix, we instruct the method that the names of the new columns generated should have the d_ prefix; in our example, the generated dummy-coded columns will have d_Condo names (as an example). The underscore _ character is default but can also be altered by specifying the prefix_sep parameter.

Tip

For a full list of parameters to the .get_dummies(...) method, see http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html.

主站蜘蛛池模板: 萨迦县| 海晏县| 彝良县| 哈巴河县| 大庆市| 磴口县| 东乡族自治县| 望江县| 新绛县| 达拉特旗| 贵德县| 衢州市| 富蕴县| 荣成市| 盘山县| 岳西县| 渭源县| 高陵县| 来宾市| 宜丰县| 姚安县| 浦东新区| 桂林市| 南京市| 望城县| 岑溪市| 新竹市| 万年县| 绵竹市| 娱乐| 高淳县| 民丰县| 石景山区| 杭州市| 芜湖市| 江阴市| 高青县| 上高县| 忻城县| 汉寿县| 德保县|