官术网_书友最值得收藏!

Feature selection

The number of explanatory features (input variables) of a sample can be enormous wherein you get xi=(xi1, xi2, xi3, ... , xid) as a training sample (observation/example) and d is very large. An example of this can be a document classification task3, where you get 10,000 different words and the input variables will be the number of occurrences of different words.

This enormous number of input variables can be problematic and sometimes a curse because we have many input variables and few training samples to help us in the learning procedure. To avoid this curse of having an enormous number of input variables (curse of dimensionality), data scientists use dimensionality reduction techniques in order to select a subset from the input variables. For example, in the text classification task they can do the following:

  • Extracting relevant inputs (for instance, mutual information measure)
  • Principal component analysis (PCA)
  • Grouping (cluster) similar words (this uses a similarity measure)
主站蜘蛛池模板: 岳阳县| 雷山县| 凤台县| 汝州市| 阜新市| 东平县| 临江市| 綦江县| 九龙坡区| 逊克县| 商城县| 上高县| 渝北区| 阿拉善右旗| 都江堰市| 绵竹市| 永胜县| 道孚县| 额敏县| 神木县| 边坝县| 黑水县| 寻乌县| 营口市| 邳州市| 威信县| 轮台县| 青阳县| 荃湾区| 昔阳县| 陈巴尔虎旗| 杂多县| 阿克陶县| 佛教| 石门县| 芦溪县| 缙云县| 平和县| 姜堰市| 那曲县| 呈贡县|