- Effective Amazon Machine Learning
- Alexis Perrier
- 540字
- 2021-07-03 00:17:48
Diving further into linear modeling for prediction
Amazon ML is based on linear modeling. Recall the equation for a straight line in the plan:

This linear equation with coefficients (a, b) can be interpreted as a predictive linear model with x as the predictor and y as the outcome. In this simple case, we have two parameters (a, b) and one predictor x. An example can be that of predicting the height of children with respect to their weight and find some a and b such that the following equation is true:

Let's consider the classic Lewis Taylor (1967) dataset with 237 samples of children's age, weight, height, and gender (https://v8doc.sas.com/sashtml/stat/chap55/sect51.htm) and focus on the relation between the height and weight of the children. In this dataset, the optimal regression line follows the following equation:

The following figure illustrates the height versus weight dataset and the associated linear regression:

Consider now that we have not one predictor but several, and let's generalize the preceding linear equation to N predictors denoted by {x1, . . . , xn} and N +1 coefficients or {wo, w1, . . ., wn} weights. The linear model can be written as follows:

Here, ? denotes the predicted value, (y would correspond to the true value to be predicted). To simplify notations, we will assume for the rest of the book the coefficient wo = 0.
This equation can be rewritten in vector form as follows:

Where T is the transpose operator, X = {x1, . . ., xn} and W= {w1, . . .,wn} are the respective vectors of predictors and model weights. Under certain conditions, the coefficients wi can be calculated precisely. However, for a large number of samples N, these calculations are expensive in terms of required computations as they involve inverting matrices of dimension N, which for large datasets is costly and slow. As the number of samples grows, it becomes more efficient to estimate these model coefficients via an iterative process.
The Stochastic Gradient Descent algorithm iteratively estimates the coefficients {wo, w1, . . ., wn} of the model. At each iteration, it uses a random sample of the training dataset for which the real outcome value is known. The SGD algorithm works by minimizing a function of the prediction error:

Functions that take the prediction error as argument are also called loss functions. Different loss functions result in different algorithms. A convex loss function has a unique minimum, which corresponds to the optimal set of weights for the regression problem. We will come back to the SGD algorithm in details in later chapters. Suffice to say for now that the SGD algorithm is especially well-suited to deal with large datasets.
There are many reasons to justify selecting the SGD algorithm for general purpose predictive analysis problems:
- It is robust
- Its convergence properties have been extensively studied and are well known
- It is well adapted to optimization techniques
- It has many extensions and variants
- It has low computational cost
- It can be applied to regression, classification, and streaming data
Some weaknesses include the following:
- The need to properly initialize its parameters
- A convergence rate dependent on a parameter called the learning rate
- 大數據技術基礎
- 從零開始學Hadoop大數據分析(視頻教學版)
- Test-Driven Development with Mockito
- 大數據算法
- 數據化網站運營深度剖析
- 智能數據時代:企業大數據戰略與實戰
- 圖數據實戰:用圖思維和圖技術解決復雜問題
- SAS金融數據挖掘與建模:系統方法與案例解析
- 聯動Oracle:設計思想、架構實現與AWR報告
- The Natural Language Processing Workshop
- 離線和實時大數據開發實戰
- Oracle 11g數據庫管理員指南
- Trino權威指南(原書第2版)
- C# 7 and .NET Core 2.0 High Performance
- 大數據:從海量到精準