- Effective Amazon Machine Learning
- Alexis Perrier
- 334字
- 2021-07-03 00:17:51
Detecting outliers
Given a variable, outliers are values that are very distant from other values of that variable. Outliers are quite common, and often caused by human or measurement errors. Outliers can strongly derail a model.
To demonstrate, let's look at two simple datasets and see how their mean is influenced by the presence of an outlier.
Consider the two datasets with few samples each: A = [1,2,3,4] and B = [1,2,3,4, 100]. The 5th value in the B dataset, 100, is obviously an outlier: mean(A) = 2.5, while mean(B) = 22. An outlier can have a large impact on a metric. Since most machine learning algorithms are based on distance or variance measurements, outliers can have a high impact on the performance of a model.
Multiple linear regression is sensitive to outlier effects, as shown in the following graph where adding a single outlier point derails the solid regression line into the dashed one:

Removing the samples associated with the outliers is the simplest solution.
Another solution can be to apply quantile binning to the predictor by splitting the values into N ordered intervals or bins, each approximately containing an equal number of samples. This will transform a numeric (continuous) predictor into a categorical one. For example, [1,2,3,4,5,6,7,8,9,10,11,100] split into three equally sized bins becomes [1,1,1,1,2,2,2,2,3,3,3,3]; the outlier value 100 has been included in the third bin and hidden.
The downside of quantile binning is that some granularity of information is lost in the process, which may degrade the performance of the model.
Quantile binning is available as a data transformation process in Amazon ML and is also used to quantify non-linearities in the original dataset.
In fact, Quantile Binning (QB) is applied by default by Amazon ML to all continuous variables that do not exhibit a straightforward linear relation to the outcome. In all our trials, and contrary to our prior assumptions, we have found that QB is a very efficient data transformation in the Amazon ML context.
- PyTorch深度學(xué)習(xí)實戰(zhàn):從新手小白到數(shù)據(jù)科學(xué)家
- Python絕技:運用Python成為頂級數(shù)據(jù)工程師
- Hands-On Machine Learning with Microsoft Excel 2019
- 復(fù)雜性思考:復(fù)雜性科學(xué)和計算模型(原書第2版)
- 計算機信息技術(shù)基礎(chǔ)實驗與習(xí)題
- 使用GitOps實現(xiàn)Kubernetes的持續(xù)部署:模式、流程及工具
- 虛擬化與云計算
- 揭秘云計算與大數(shù)據(jù)
- 數(shù)據(jù)庫應(yīng)用基礎(chǔ)教程(Visual FoxPro 9.0)
- Hadoop大數(shù)據(jù)實戰(zhàn)權(quán)威指南(第2版)
- Ceph源碼分析
- 數(shù)據(jù)庫技術(shù)及應(yīng)用教程
- Python金融數(shù)據(jù)分析(原書第2版)
- gnuplot Cookbook
- 重復(fù)數(shù)據(jù)刪除技術(shù):面向大數(shù)據(jù)管理的縮減技術(shù)