官术网_书友最值得收藏!

Feature selection

The number of explanatory features (input variables) of a sample can be enormous wherein you get xi=(xi1, xi2, xi3, ... , xid) as a training sample (observation/example) and d is very large. An example of this can be a document classification task3, where you get 10,000 different words and the input variables will be the number of occurrences of different words.

This enormous number of input variables can be problematic and sometimes a curse because we have many input variables and few training samples to help us in the learning procedure. To avoid this curse of having an enormous number of input variables (curse of dimensionality), data scientists use dimensionality reduction techniques in order to select a subset from the input variables. For example, in the text classification task they can do the following:

  • Extracting relevant inputs (for instance, mutual information measure)
  • Principal component analysis (PCA)
  • Grouping (cluster) similar words (this uses a similarity measure)
主站蜘蛛池模板: 四川省| 塔城市| 临高县| 阿合奇县| 韶关市| 米脂县| 云林县| 鸡泽县| 太仆寺旗| 苍南县| 缙云县| 新昌县| 文登市| 黔江区| 神木县| 长顺县| 福泉市| 南投县| 通化市| 阿坝| 阿坝县| 清远市| 黄冈市| 宁国市| 安西县| 盐亭县| 清徐县| 安仁县| 乌兰察布市| 尉犁县| 兰考县| 三亚市| 西和县| 轮台县| 宜兴市| 雅江县| 长春市| 修武县| 青铜峡市| 普兰县| 呼和浩特市|