- Machine Learning Algorithms
- Giuseppe Bonaccorso
- 396字
- 2021-07-02 18:53:30
Feature selection and filtering
An unnormalized dataset with many features contains information proportional to the independence of all features and their variance. Let's consider a small dataset with three features, generated with random Gaussian distributions:

Even without further analysis, it's obvious that the central line (with the lowest variance) is almost constant and doesn't provide any useful information. If you remember the previous chapter, the entropy H(X) is quite small, while the other two variables carry more information. A variance threshold is, therefore, a useful approach to remove all those elements whose contribution (in terms of variability and so, information) is under a predefined level. scikit-learn provides the class VarianceThreshold that can easily solve this problem. By applying it on the previous dataset, we get the following result:
from sklearn.feature_selection import VarianceThreshold
>>> X[0:3, :]
array([[-3.5077778 , -3.45267063, 0.9681903 ],
[-3.82581314, 5.77984656, 1.78926338],
[-2.62090281, -4.90597966, 0.27943565]])
>>> vt = VarianceThreshold(threshold=1.5)
>>> X_t = vt.fit_transform(X)
>>> X_t[0:3, :]
array([[-0.53478521, -2.69189452],
[-5.33054034, -1.91730367],
[-1.17004376, 6.32836981]])
The third feature has been completely removed because its variance is under the selected threshold (1.5 in this case).
There are also many univariate methods that can be used in order to select the best features according to specific criteria based on F-tests and p-values, such as chi-square or ANOVA. However, their discussion is beyond the scope of this book and the reader can find further information in Freedman D., Pisani R., Purves R., Statistics, Norton & Company.
Two examples of feature selection that use the classes SelectKBest (which selects the best K high-score features) and SelectPercentile (which selects only a subset of features belonging to a certain percentile) are shown next. It's possible to apply them both to regression and classification datasets, being careful to select appropriate score functions:
from sklearn.datasets import load_boston, load_iris
from sklearn.feature_selection import SelectKBest, SelectPercentile, chi2, f_regression
>>> regr_data = load_boston()
>>> regr_data.data.shape
(506L, 13L)
>>> kb_regr = SelectKBest(f_regression)
>>> X_b = kb_regr.fit_transform(regr_data.data, regr_data.target)
>>> X_b.shape
(506L, 10L)
>>> kb_regr.scores_
array([ 88.15124178, 75.2576423 , 153.95488314, 15.97151242,
112.59148028, 471.84673988, 83.47745922, 33.57957033,
85.91427767, 141.76135658, 175.10554288, 63.05422911,
601.61787111])
>>> class_data = load_iris()
>>> class_data.data.shape
(150L, 4L)
>>> perc_class = SelectPercentile(chi2, percentile=15)
>>> X_p = perc_class.fit_transform(class_data.data, class_data.target)
>>> X_p.shape
(150L, 1L)
>>> perc_class.scores_
array([ 10.81782088, 3.59449902, 116.16984746, 67.24482759])
- 小程序實戰視頻課:微信小程序開發全案精講
- 控糖控脂健康餐
- 編程卓越之道(卷3):軟件工程化
- 重學Java設計模式
- Expert Data Visualization
- FPGA Verilog開發實戰指南:基于Intel Cyclone IV(進階篇)
- Mastering Git
- Python計算機視覺和自然語言處理
- Learning C++ by Creating Games with UE4
- The Statistics and Calculus with Python Workshop
- 從零開始學Python大數據與量化交易
- 原型設計:打造成功產品的實用方法及實踐
- 關系數據庫與SQL Server 2012(第3版)
- Distributed Computing with Python
- Kotlin程序員面試算法寶典