官术网_书友最值得收藏!

Missing values

Data aggregation, extraction, and consolidation is often not perfect and sometimes results in missing values. There are several common strategies to deal with missing values in datasets:

  • Removing all the rows with missing values from the dataset. This is simple to apply, but you may end up throwing away a big chunk of information that would have been valuable to your model.
  • Using models that are, by nature, not impacted by missing values such as decision tree-based models: random forests, boosted trees. Unfortunately, the linear regression model, and by extension the SGD algorithm, does not work with missing values (http://facweb.cs.depaul.edu/sjost/csc423/documents/missing_values.pdf).
  • Imputing the missing data with replacement values; for example, replacing missing values with the median, the average, or the harmonic mean of all the existing values, or using clustering or linear regression to predict the missing values. It may be interesting to add the information that these values were missing in the first place to the dataset.

In the end, the right strategy will depend on the type of missing data and of course, the context. While replacing missing blood pressure numbers in a patient medical record by some average may not be acceptable in a healthcare context, replacing missing age values by the average age in the Titanic dataset is definitely adapted to a data science competition.

However, Amazon ML's documentation is not 100% clear on the strategy used to deal with missing values:

If the target attribute is present in the record, but a value for another numeric attribute is missing, then Amazon ML overlooks the missing value. In this case, Amazon ML creates a substitute attribute and sets it to 1 to indicate that this attribute is missing.

In the case of missing values, a new column is created with a Boolean flag to indicate that the value was missing in the first place. But it is not clear whether the whole row or sample is dismissed or overlooked or if just the cell is removed. There is no mention of any type of imputation.

主站蜘蛛池模板: 桐梓县| 温泉县| 宣汉县| 穆棱市| 上高县| 巴林左旗| 伊金霍洛旗| 浙江省| 蒲城县| 商洛市| 普兰县| 巴塘县| 英德市| 大竹县| 高平市| 疏附县| 台山市| 永定县| 平谷区| 湘乡市| 左权县| 南丰县| 理塘县| 枝江市| 湟源县| 榆树市| 琼海市| 苗栗县| 龙江县| 新宁县| 威信县| 屏东县| 安新县| 都昌县| 揭西县| 诏安县| 华容县| 泰来县| 江川县| 沾益县| 江永县|