官术网_书友最值得收藏!

Managing missing features

Sometimes a dataset can contain missing features, so there are a few options that can be taken into account:

  • Removing the whole line
  • Creating sub-model to predict those features
  • Using an automatic strategy to input them according to the other known values

The first option is the most drastic one and should be considered only when the dataset is quite large, the number of missing features is high, and any prediction could be risky. The second option is much more difficult because it's necessary to determine a supervised strategy to train a model for each feature and, finally, to predict their value. Considering all pros and cons, the third option is likely to be the best choice. scikit-learn offers the class Imputer, which is responsible for filling the holes using a strategy based on the mean (default choice), median, or frequency (the most frequent entry will be used for all the missing ones).

The following snippet shows an example using the three approaches (the default value for a missing feature entry is NaN. However, it's possible to use a different placeholder through the parameter missing_values):

from sklearn.preprocessing import Imputer

>>> data = np.array([[1, np.nan, 2], [2, 3, np.nan], [-1, 4, 2]])

>>> imp = Imputer(strategy='mean')
>>> imp.fit_transform(data)
array([[ 1. , 3.5, 2. ],
[ 2. , 3. , 2. ],
[-1. , 4. , 2. ]])

>>> imp = Imputer(strategy='median')
>>> imp.fit_transform(data)
array([[ 1. , 3.5, 2. ],
[ 2. , 3. , 2. ],
[-1. , 4. , 2. ]])

>>> imp = Imputer(strategy='most_frequent')
>>> imp.fit_transform(data)
array([[ 1., 3., 2.],
[ 2., 3., 2.],
[-1., 4., 2.]])
主站蜘蛛池模板: 晴隆县| 济南市| 专栏| 湘乡市| 郧西县| 凤山县| 石首市| 宁河县| 灵武市| 称多县| 赤壁市| 牡丹江市| 德令哈市| 沙洋县| 奉贤区| 沈阳市| 金塔县| 宁都县| 长白| 龙海市| 紫阳县| 浙江省| 财经| 崇左市| 志丹县| 三江| 盐山县| 应用必备| 调兵山市| 新余市| 南雄市| 沙洋县| 卢氏县| 西昌市| 双峰县| 云南省| 张掖市| 铜陵市| 祁门县| 凤山市| 安陆市|