官术网_书友最值得收藏!

Feature selection and filtering

An unnormalized dataset with many features contains information proportional to the independence of all features and their variance. Let's consider a small dataset with three features, generated with random Gaussian distributions:

Even without further analysis, it's obvious that the central line (with the lowest variance) is almost constant and doesn't provide any useful information. If you remember the previous chapter, the entropy H(X) is quite small, while the other two variables carry more information. A variance threshold is, therefore, a useful approach to remove all those elements whose contribution (in terms of variability and so, information) is under a predefined level. scikit-learn provides the class VarianceThreshold that can easily solve this problem. By applying it on the previous dataset, we get the following result:

from sklearn.feature_selection import VarianceThreshold

>>> X[0:3, :]
array([[-3.5077778 , -3.45267063, 0.9681903 ],
[-3.82581314, 5.77984656, 1.78926338],
[-2.62090281, -4.90597966, 0.27943565]])

>>> vt = VarianceThreshold(threshold=1.5)
>>> X_t = vt.fit_transform(X)

>>> X_t[0:3, :]
array([[-0.53478521, -2.69189452],
[-5.33054034, -1.91730367],
[-1.17004376, 6.32836981]])

The third feature has been completely removed because its variance is under the selected threshold (1.5 in this case).

There are also many univariate methods that can be used in order to select the best features according to specific criteria based on F-tests and p-values, such as chi-square or ANOVA. However, their discussion is beyond the scope of this book and the reader can find further information in Freedman D., Pisani R., Purves R., Statistics, Norton & Company.

Two examples of feature selection that use the classes SelectKBest (which selects the best K high-score features) and SelectPercentile (which selects only a subset of features belonging to a certain percentile) are shown next. It's possible to apply them both to regression and classification datasets, being careful to select appropriate score functions: 

from sklearn.datasets import load_boston, load_iris
from sklearn.feature_selection import SelectKBest, SelectPercentile, chi2, f_regression

>>> regr_data = load_boston()
>>> regr_data.data.shape
(506L, 13L)

>>> kb_regr = SelectKBest(f_regression)
>>> X_b = kb_regr.fit_transform(regr_data.data, regr_data.target)

>>> X_b.shape
(506L, 10L)

>>> kb_regr.scores_
array([ 88.15124178, 75.2576423 , 153.95488314, 15.97151242,
112.59148028, 471.84673988, 83.47745922, 33.57957033,
85.91427767, 141.76135658, 175.10554288, 63.05422911,
601.61787111])

>>> class_data = load_iris()
>>> class_data.data.shape
(150L, 4L)

>>> perc_class = SelectPercentile(chi2, percentile=15)
>>> X_p = perc_class.fit_transform(class_data.data, class_data.target)

>>> X_p.shape
(150L, 1L)

>>> perc_class.scores_
array([ 10.81782088, 3.59449902, 116.16984746, 67.24482759])

For further details about all scikit-learn score functions and their usage, visit  http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection.
主站蜘蛛池模板: 武威市| 凤城市| 仁怀市| 阳城县| 榆社县| 墨竹工卡县| 体育| 正定县| 永宁县| 商水县| 雷波县| 景德镇市| 苏州市| 安塞县| 阜宁县| 唐山市| 珲春市| 年辖:市辖区| 镇平县| 广西| 广德县| 绵阳市| 乌兰县| 商都县| 明水县| 科技| 武平县| 麻栗坡县| 平泉县| 东城区| 西乌珠穆沁旗| 大渡口区| 吴堡县| 渭南市| 赤城县| 景德镇市| 旺苍县| 淮阳县| 花垣县| 剑阁县| 壶关县|