官术网_书友最值得收藏!

Accepting non-linear patterns

A linear regression model implies that the outcome can be estimated by a linear combination of the predictors. This, of course, is not always the case, as features often exhibit nonlinear patterns.

Consider the following graph, where Y axis depends on X axis but the relationship displays an obvious quadratic pattern. Fitting a line (y = aX + b) as a prediction model of Y as a function of X does not work:

Some models and algorithms are able to naturally handle non-linearities, for example, tree-based models or support vector machines with non-linear kernels. Linear regression and SGD are not.

Transformations: One way to deal with these nonlinear patterns in the context of linear regression is to transform the predictors. In the preceding simple example, adding the square of the predictor X to the model would give a much better result. The model would now be of the following form:

And as shown in the following diagram, the new quadratic model fits the data much better:

We are not restricted to the quadratic case, and a power function of higher order can be used to transform existing attributes and create new predictors. Other useful transformations could include taking the logarithm, exponential, sine and cosine, and so on. The Boxcox transformation (http://onlinestatbook.com/2/transformations/box-cox.html) is worth citing at this point. It's an efficient data transformation that reduces skewness and kurtosis of a variable distribution. It reshapes the variable distribution into one closer to a Gaussian distribution.

Splines are an excellent and more powerful alternative to polynomial interpolation. Splines are piece-wise polynomials that join smoothly. At their simplest level, splines consists of lines that are connected together at different points. Splines are not available in Amazon ML.

Quantile binning is the Amazon ML solution to non-linearities. By splitting the data into N bins, you remove any non-linearities in the bin's intervals. Although binning has several drawbacks (http://biostat.mc.vanderbilt.edu/wiki/Main/CatContinuous), the main one being that information is discarded in the process, it has been shown to generate excellent prediction performance in the Amazon ML platform.

主站蜘蛛池模板: 合山市| 策勒县| 金溪县| 务川| 兰州市| 百色市| 新绛县| 新邵县| 邻水| 玉林市| 南溪县| 绵竹市| 南溪县| 竹溪县| 五常市| 天长市| 和硕县| 阳曲县| 两当县| 吉首市| 江源县| 类乌齐县| 镇安县| 保靖县| 正蓝旗| 八宿县| 上栗县| 剑川县| 霍林郭勒市| 改则县| 颍上县| 长宁区| 黄平县| 调兵山市| 荔波县| 博客| 金乡县| 南陵县| 郧西县| 海口市| 潼关县|