- Effective Amazon Machine Learning
- Alexis Perrier
- 342字
- 2021-07-03 00:17:51
Accepting non-linear patterns
A linear regression model implies that the outcome can be estimated by a linear combination of the predictors. This, of course, is not always the case, as features often exhibit nonlinear patterns.
Consider the following graph, where Y axis depends on X axis but the relationship displays an obvious quadratic pattern. Fitting a line (y = aX + b) as a prediction model of Y as a function of X does not work:

Some models and algorithms are able to naturally handle non-linearities, for example, tree-based models or support vector machines with non-linear kernels. Linear regression and SGD are not.
Transformations: One way to deal with these nonlinear patterns in the context of linear regression is to transform the predictors. In the preceding simple example, adding the square of the predictor X to the model would give a much better result. The model would now be of the following form:

And as shown in the following diagram, the new quadratic model fits the data much better:

We are not restricted to the quadratic case, and a power function of higher order can be used to transform existing attributes and create new predictors. Other useful transformations could include taking the logarithm, exponential, sine and cosine, and so on. The Boxcox transformation (http://onlinestatbook.com/2/transformations/box-cox.html) is worth citing at this point. It's an efficient data transformation that reduces skewness and kurtosis of a variable distribution. It reshapes the variable distribution into one closer to a Gaussian distribution.
Splines are an excellent and more powerful alternative to polynomial interpolation. Splines are piece-wise polynomials that join smoothly. At their simplest level, splines consists of lines that are connected together at different points. Splines are not available in Amazon ML.
Quantile binning is the Amazon ML solution to non-linearities. By splitting the data into N bins, you remove any non-linearities in the bin's intervals. Although binning has several drawbacks (http://biostat.mc.vanderbilt.edu/wiki/Main/CatContinuous), the main one being that information is discarded in the process, it has been shown to generate excellent prediction performance in the Amazon ML platform.
- 大數(shù)據(jù)技術(shù)基礎(chǔ)
- 數(shù)據(jù)庫原理及應(yīng)用教程(第4版)(微課版)
- 使用GitOps實現(xiàn)Kubernetes的持續(xù)部署:模式、流程及工具
- 文本數(shù)據(jù)挖掘:基于R語言
- Spark核心技術(shù)與高級應(yīng)用
- Python醫(yī)學(xué)數(shù)據(jù)分析入門
- 數(shù)據(jù)革命:大數(shù)據(jù)價值實現(xiàn)方法、技術(shù)與案例
- 云原生數(shù)據(jù)中臺:架構(gòu)、方法論與實踐
- Hadoop 3實戰(zhàn)指南
- 貫通SQL Server 2008數(shù)據(jù)庫系統(tǒng)開發(fā)
- 二進制分析實戰(zhàn)
- Visual FoxPro數(shù)據(jù)庫技術(shù)基礎(chǔ)
- 算法設(shè)計與分析
- 數(shù)據(jù)分析思維:產(chǎn)品經(jīng)理的成長筆記
- 數(shù)據(jù)應(yīng)用工程:方法論與實踐