官术网_书友最值得收藏!

Managing missing features

Sometimes a dataset can contain missing features, so there are a few options that can be taken into account:

  • Removing the whole line
  • Creating sub-model to predict those features
  • Using an automatic strategy to input them according to the other known values

The first option is the most drastic one and should be considered only when the dataset is quite large, the number of missing features is high, and any prediction could be risky. The second option is much more difficult because it's necessary to determine a supervised strategy to train a model for each feature and, finally, to predict their value. Considering all pros and cons, the third option is likely to be the best choice. scikit-learn offers the class Imputer, which is responsible for filling the holes using a strategy based on the mean (default choice), median, or frequency (the most frequent entry will be used for all the missing ones).

The following snippet shows an example using the three approaches (the default value for a missing feature entry is NaN. However, it's possible to use a different placeholder through the parameter missing_values):

from sklearn.preprocessing import Imputer

>>> data = np.array([[1, np.nan, 2], [2, 3, np.nan], [-1, 4, 2]])

>>> imp = Imputer(strategy='mean')
>>> imp.fit_transform(data)
array([[ 1. , 3.5, 2. ],
[ 2. , 3. , 2. ],
[-1. , 4. , 2. ]])

>>> imp = Imputer(strategy='median')
>>> imp.fit_transform(data)
array([[ 1. , 3.5, 2. ],
[ 2. , 3. , 2. ],
[-1. , 4. , 2. ]])

>>> imp = Imputer(strategy='most_frequent')
>>> imp.fit_transform(data)
array([[ 1., 3., 2.],
[ 2., 3., 2.],
[-1., 4., 2.]])
主站蜘蛛池模板: 桃园市| 廊坊市| 治县。| 璧山县| 文成县| 余干县| 睢宁县| 冀州市| 河源市| 包头市| 崇仁县| 双流县| 青神县| 商丘市| 庐江县| 舒城县| 华亭县| 墨玉县| 富阳市| 元朗区| 个旧市| 思南县| 清流县| 韩城市| 阿克苏市| 宕昌县| 巨野县| 黄骅市| 丽水市| 万山特区| 晋中市| 康保县| 湘潭市| 贺兰县| 鹤庆县| 喀喇| 壤塘县| 河池市| 巴彦县| 饶平县| 祁门县|