官术网_书友最值得收藏!

One-hot encoding

Most of the machine learning algorithms can't work with the categorical variables, so usually we want to convert them to the one-hot vectors (statisticians prefer to call them dummy variables). Let's convert first, and then I will explain what this is:

In []: 
features = pd.get_dummies(features, columns = ['color']) 
features.head() 
Out[]: 

So now, instead of one column, color, we have four columns: color_light black, color_pink gold, color_purple polka dot, and color_space gray. The color of each sample is encoded as 1 in the corresponding column. Why do we need this if we could simply replace colors with the numbers from 1 to 4? Well, this is the problem: why to prefer 1 to 4 over the 4 to 1, or powers of 2, or prime numbers? These colors on their own don't carry any quantitative information associated to them. They can't be sorted from the largest to the smallest. If we introduce this information artificially, the machine learning algorithm may attempt to utilize that meaningless information, and we will end up with the classifier that sees regularities where there are none.

主站蜘蛛池模板: 行唐县| 峨眉山市| 娱乐| 托里县| 陵川县| 隆子县| 景德镇市| 贺兰县| 北安市| 铜川市| 寿光市| 延安市| 通化县| 德惠市| 竹山县| 家居| 淮北市| 东源县| 略阳县| 平乡县| 二连浩特市| 元朗区| 泗阳县| 墨玉县| 犍为县| 开远市| 民乐县| 温宿县| 茌平县| 驻马店市| 屏南县| 凤城市| 苏尼特右旗| 社会| 会宁县| 南江县| 吉安县| 屯留县| 佳木斯市| 微博| 焦作市|