- Machine Learning Algorithms
- Giuseppe Bonaccorso
- 273字
- 2021-07-02 18:53:30
Managing missing features
Sometimes a dataset can contain missing features, so there are a few options that can be taken into account:
- Removing the whole line
- Creating sub-model to predict those features
- Using an automatic strategy to input them according to the other known values
The first option is the most drastic one and should be considered only when the dataset is quite large, the number of missing features is high, and any prediction could be risky. The second option is much more difficult because it's necessary to determine a supervised strategy to train a model for each feature and, finally, to predict their value. Considering all pros and cons, the third option is likely to be the best choice. scikit-learn offers the class Imputer, which is responsible for filling the holes using a strategy based on the mean (default choice), median, or frequency (the most frequent entry will be used for all the missing ones).
The following snippet shows an example using the three approaches (the default value for a missing feature entry is NaN. However, it's possible to use a different placeholder through the parameter missing_values):
from sklearn.preprocessing import Imputer
>>> data = np.array([[1, np.nan, 2], [2, 3, np.nan], [-1, 4, 2]])
>>> imp = Imputer(strategy='mean')
>>> imp.fit_transform(data)
array([[ 1. , 3.5, 2. ],
[ 2. , 3. , 2. ],
[-1. , 4. , 2. ]])
>>> imp = Imputer(strategy='median')
>>> imp.fit_transform(data)
array([[ 1. , 3.5, 2. ],
[ 2. , 3. , 2. ],
[-1. , 4. , 2. ]])
>>> imp = Imputer(strategy='most_frequent')
>>> imp.fit_transform(data)
array([[ 1., 3., 2.],
[ 2., 3., 2.],
[-1., 4., 2.]])
- Puppet 4 Essentials(Second Edition)
- Google Apps Script for Beginners
- 基于粒計算模型的圖像處理
- Web程序設計(第二版)
- Python完全自學教程
- SQL Server從入門到精通(第3版)
- 從Java到Web程序設計教程
- Unity 2D Game Development Cookbook
- 計算機應用基礎案例教程
- 速學Python:程序設計從入門到進階
- C指針原理揭秘:基于底層實現(xiàn)機制
- Flink技術內(nèi)幕:架構(gòu)設計與實現(xiàn)原理
- C語言程序設計實踐
- 官方 Scratch 3.0 編程趣味卡:讓孩子們愛上編程(全彩)
- 深入實踐C++模板編程