- Effective Amazon Machine Learning
- Alexis Perrier
- 540字
- 2021-07-03 00:17:48
Diving further into linear modeling for prediction
Amazon ML is based on linear modeling. Recall the equation for a straight line in the plan:

This linear equation with coefficients (a, b) can be interpreted as a predictive linear model with x as the predictor and y as the outcome. In this simple case, we have two parameters (a, b) and one predictor x. An example can be that of predicting the height of children with respect to their weight and find some a and b such that the following equation is true:

Let's consider the classic Lewis Taylor (1967) dataset with 237 samples of children's age, weight, height, and gender (https://v8doc.sas.com/sashtml/stat/chap55/sect51.htm) and focus on the relation between the height and weight of the children. In this dataset, the optimal regression line follows the following equation:

The following figure illustrates the height versus weight dataset and the associated linear regression:

Consider now that we have not one predictor but several, and let's generalize the preceding linear equation to N predictors denoted by {x1, . . . , xn} and N +1 coefficients or {wo, w1, . . ., wn} weights. The linear model can be written as follows:

Here, ? denotes the predicted value, (y would correspond to the true value to be predicted). To simplify notations, we will assume for the rest of the book the coefficient wo = 0.
This equation can be rewritten in vector form as follows:

Where T is the transpose operator, X = {x1, . . ., xn} and W= {w1, . . .,wn} are the respective vectors of predictors and model weights. Under certain conditions, the coefficients wi can be calculated precisely. However, for a large number of samples N, these calculations are expensive in terms of required computations as they involve inverting matrices of dimension N, which for large datasets is costly and slow. As the number of samples grows, it becomes more efficient to estimate these model coefficients via an iterative process.
The Stochastic Gradient Descent algorithm iteratively estimates the coefficients {wo, w1, . . ., wn} of the model. At each iteration, it uses a random sample of the training dataset for which the real outcome value is known. The SGD algorithm works by minimizing a function of the prediction error:

Functions that take the prediction error as argument are also called loss functions. Different loss functions result in different algorithms. A convex loss function has a unique minimum, which corresponds to the optimal set of weights for the regression problem. We will come back to the SGD algorithm in details in later chapters. Suffice to say for now that the SGD algorithm is especially well-suited to deal with large datasets.
There are many reasons to justify selecting the SGD algorithm for general purpose predictive analysis problems:
- It is robust
- Its convergence properties have been extensively studied and are well known
- It is well adapted to optimization techniques
- It has many extensions and variants
- It has low computational cost
- It can be applied to regression, classification, and streaming data
Some weaknesses include the following:
- The need to properly initialize its parameters
- A convergence rate dependent on a parameter called the learning rate
- 數(shù)據(jù)庫基礎教程(SQL Server平臺)
- SQL入門經(jīng)典(第5版)
- 數(shù)據(jù)分析實戰(zhàn):基于EXCEL和SPSS系列工具的實踐
- 輕松學大數(shù)據(jù)挖掘:算法、場景與數(shù)據(jù)產(chǎn)品
- 數(shù)據(jù)結(jié)構(gòu)與算法(C語言版)
- Python金融實戰(zhàn)
- Oracle PL/SQL實例精解(原書第5版)
- Apache Kylin權(quán)威指南
- 云數(shù)據(jù)中心網(wǎng)絡與SDN:技術(shù)架構(gòu)與實現(xiàn)
- Splunk智能運維實戰(zhàn)
- SQL Server深入詳解
- 貫通SQL Server 2008數(shù)據(jù)庫系統(tǒng)開發(fā)
- Oracle高性能SQL引擎剖析:SQL優(yōu)化與調(diào)優(yōu)機制詳解
- SIEMENS數(shù)控技術(shù)應用工程師:SINUMERIK 840D-810D數(shù)控系統(tǒng)功能應用與維修調(diào)整教程
- 云計算