官术网_书友最值得收藏!

Managing missing features

Sometimes a dataset can contain missing features, so there are a few options that can be taken into account:

  • Removing the whole line
  • Creating sub-model to predict those features
  • Using an automatic strategy to input them according to the other known values

The first option is the most drastic one and should be considered only when the dataset is quite large, the number of missing features is high, and any prediction could be risky. The second option is much more difficult because it's necessary to determine a supervised strategy to train a model for each feature and, finally, to predict their value. Considering all pros and cons, the third option is likely to be the best choice. scikit-learn offers the class Imputer, which is responsible for filling the holes using a strategy based on the mean (default choice), median, or frequency (the most frequent entry will be used for all the missing ones).

The following snippet shows an example using the three approaches (the default value for a missing feature entry is NaN. However, it's possible to use a different placeholder through the parameter missing_values):

from sklearn.preprocessing import Imputer

>>> data = np.array([[1, np.nan, 2], [2, 3, np.nan], [-1, 4, 2]])

>>> imp = Imputer(strategy='mean')
>>> imp.fit_transform(data)
array([[ 1. , 3.5, 2. ],
[ 2. , 3. , 2. ],
[-1. , 4. , 2. ]])

>>> imp = Imputer(strategy='median')
>>> imp.fit_transform(data)
array([[ 1. , 3.5, 2. ],
[ 2. , 3. , 2. ],
[-1. , 4. , 2. ]])

>>> imp = Imputer(strategy='most_frequent')
>>> imp.fit_transform(data)
array([[ 1., 3., 2.],
[ 2., 3., 2.],
[-1., 4., 2.]])
主站蜘蛛池模板: 云安县| 浦城县| 郑州市| 区。| 拉萨市| 乌鲁木齐县| 和林格尔县| 都安| 定结县| 龙口市| 岐山县| 龙口市| 青龙| 濉溪县| 嘉祥县| 牟定县| 巫溪县| 织金县| 浠水县| 清涧县| 东乌珠穆沁旗| 邻水| 大埔区| 康马县| 南投市| 安塞县| 遂宁市| 邵阳县| 兰溪市| 崇明县| 高尔夫| 孟津县| 塘沽区| 临朐县| 三都| 灵川县| 扎鲁特旗| 渭源县| 荆州市| 佳木斯市| 五大连池市|