官术网_书友最值得收藏!

  • Machine Learning with Swift
  • Alexander Sosnovshchenko
  • 184字
  • 2021-06-24 18:54:56

One-hot encoding

Most of the machine learning algorithms can't work with the categorical variables, so usually we want to convert them to the one-hot vectors (statisticians prefer to call them dummy variables). Let's convert first, and then I will explain what this is:

In []: 
features = pd.get_dummies(features, columns = ['color']) 
features.head() 
Out[]: 

So now, instead of one column, color, we have four columns: color_light black, color_pink gold, color_purple polka dot, and color_space gray. The color of each sample is encoded as 1 in the corresponding column. Why do we need this if we could simply replace colors with the numbers from 1 to 4? Well, this is the problem: why to prefer 1 to 4 over the 4 to 1, or powers of 2, or prime numbers? These colors on their own don't carry any quantitative information associated to them. They can't be sorted from the largest to the smallest. If we introduce this information artificially, the machine learning algorithm may attempt to utilize that meaningless information, and we will end up with the classifier that sees regularities where there are none.

主站蜘蛛池模板: 吉木乃县| 铜陵市| 白山市| 龙海市| 涿州市| 刚察县| 师宗县| 新沂市| 布拖县| 和龙市| 玉门市| 上杭县| 青海省| 古田县| 延津县| 凤山市| 荆门市| 通许县| 景泰县| 桐乡市| 赫章县| 榆树市| 芜湖县| 丰原市| 大庆市| 平阳县| 柘荣县| 克什克腾旗| 澄江县| 安丘市| 登封市| 莱阳市| 合水县| 威远县| 新邵县| 诸暨市| 塔城市| 桃源县| 肥城市| 通州市| 鄂温|