官术网_书友最值得收藏!

One-hot encoding

Most of the machine learning algorithms can't work with the categorical variables, so usually we want to convert them to the one-hot vectors (statisticians prefer to call them dummy variables). Let's convert first, and then I will explain what this is:

In []: 
features = pd.get_dummies(features, columns = ['color']) 
features.head() 
Out[]: 

So now, instead of one column, color, we have four columns: color_light black, color_pink gold, color_purple polka dot, and color_space gray. The color of each sample is encoded as 1 in the corresponding column. Why do we need this if we could simply replace colors with the numbers from 1 to 4? Well, this is the problem: why to prefer 1 to 4 over the 4 to 1, or powers of 2, or prime numbers? These colors on their own don't carry any quantitative information associated to them. They can't be sorted from the largest to the smallest. If we introduce this information artificially, the machine learning algorithm may attempt to utilize that meaningless information, and we will end up with the classifier that sees regularities where there are none.

主站蜘蛛池模板: 吕梁市| 石台县| 深圳市| 丰顺县| 东辽县| 东丽区| 山东| 元阳县| 全南县| 巴林右旗| 疏勒县| 连平县| 朝阳县| 临泽县| 太康县| 苗栗市| 鄱阳县| 拜城县| 广德县| 桐城市| 晋宁县| 朝阳县| 铅山县| 沅江市| 庆城县| 霍州市| 巩义市| 云南省| 庆阳市| 中卫市| 吕梁市| 灵丘县| 阿巴嘎旗| 乌拉特前旗| 磐安县| 萝北县| 平顺县| 朝阳区| 隆德县| 永胜县| 南昌市|