官术网_书友最值得收藏!

Managing missing features

Sometimes a dataset can contain missing features, so there are a few options that can be taken into account:

  • Removing the whole line
  • Creating sub-model to predict those features
  • Using an automatic strategy to input them according to the other known values

The first option is the most drastic one and should be considered only when the dataset is quite large, the number of missing features is high, and any prediction could be risky. The second option is much more difficult because it's necessary to determine a supervised strategy to train a model for each feature and, finally, to predict their value. Considering all pros and cons, the third option is likely to be the best choice. scikit-learn offers the class Imputer, which is responsible for filling the holes using a strategy based on the mean (default choice), median, or frequency (the most frequent entry will be used for all the missing ones).

The following snippet shows an example using the three approaches (the default value for a missing feature entry is NaN. However, it's possible to use a different placeholder through the parameter missing_values):

from sklearn.preprocessing import Imputer

>>> data = np.array([[1, np.nan, 2], [2, 3, np.nan], [-1, 4, 2]])

>>> imp = Imputer(strategy='mean')
>>> imp.fit_transform(data)
array([[ 1. , 3.5, 2. ],
[ 2. , 3. , 2. ],
[-1. , 4. , 2. ]])

>>> imp = Imputer(strategy='median')
>>> imp.fit_transform(data)
array([[ 1. , 3.5, 2. ],
[ 2. , 3. , 2. ],
[-1. , 4. , 2. ]])

>>> imp = Imputer(strategy='most_frequent')
>>> imp.fit_transform(data)
array([[ 1., 3., 2.],
[ 2., 3., 2.],
[-1., 4., 2.]])
主站蜘蛛池模板: 祁东县| 乌拉特中旗| 新邵县| 保德县| 洪洞县| 商都县| 神池县| 宜兰县| 崇文区| 长宁县| 榕江县| 泰来县| 高尔夫| 黄梅县| 容城县| 娄底市| 龙山县| 南涧| 兴宁市| 金山区| 临澧县| 平安县| 兴国县| 志丹县| 河源市| 广南县| 清河县| 平陆县| 武功县| 葵青区| 曲沃县| 磴口县| 衡水市| 龙井市| 庐江县| 法库县| 安西县| 葵青区| 新密市| 城固县| 丰都县|