- Effective Amazon Machine Learning
- Alexis Perrier
- 343字
- 2021-07-03 00:17:50
Missing values
Data aggregation, extraction, and consolidation is often not perfect and sometimes results in missing values. There are several common strategies to deal with missing values in datasets:
- Removing all the rows with missing values from the dataset. This is simple to apply, but you may end up throwing away a big chunk of information that would have been valuable to your model.
- Using models that are, by nature, not impacted by missing values such as decision tree-based models: random forests, boosted trees. Unfortunately, the linear regression model, and by extension the SGD algorithm, does not work with missing values (http://facweb.cs.depaul.edu/sjost/csc423/documents/missing_values.pdf).
- Imputing the missing data with replacement values; for example, replacing missing values with the median, the average, or the harmonic mean of all the existing values, or using clustering or linear regression to predict the missing values. It may be interesting to add the information that these values were missing in the first place to the dataset.
In the end, the right strategy will depend on the type of missing data and of course, the context. While replacing missing blood pressure numbers in a patient medical record by some average may not be acceptable in a healthcare context, replacing missing age values by the average age in the Titanic dataset is definitely adapted to a data science competition.
However, Amazon ML's documentation is not 100% clear on the strategy used to deal with missing values:
If the target attribute is present in the record, but a value for another numeric attribute is missing, then Amazon ML overlooks the missing value. In this case, Amazon ML creates a substitute attribute and sets it to 1 to indicate that this attribute is missing.
In the case of missing values, a new column is created with a Boolean flag to indicate that the value was missing in the first place. But it is not clear whether the whole row or sample is dismissed or overlooked or if just the cell is removed. There is no mention of any type of imputation.
- 大規模數據分析和建模:基于Spark與R
- SQL Server入門經典
- InfluxDB原理與實戰
- Access 2016數據庫技術及應用
- 大數據營銷:如何讓營銷更具吸引力
- 大數據Hadoop 3.X分布式處理實戰
- Learning Proxmox VE
- 智能數據時代:企業大數據戰略與實戰
- 數據庫技術實用教程
- Power BI商業數據分析完全自學教程
- 新基建:數據中心創新之路
- Mastering LOB Development for Silverlight 5:A Case Study in Action
- 二進制分析實戰
- Mastering ROS for Robotics Programming(Second Edition)
- SQL Server 2008寶典(第2版)