官术网_书友最值得收藏!

Training and evaluation in Amazon ML

In the context of Amazon ML, the model is linear regression and the algorithm the Stochastic Gradient Descent (SGD) algorithm. This algorithm has one main meta parameter called the learning rate and often noted , which dictates how much of a new sample is taken into account for each iterative update of the weights. A larger learning rate makes the algorithm converge faster but stabilizes further from the optimal weights, while a smaller learning induces a slower convergence but a more precise set of regression coefficients.

Given a training and a validation dataset, this is how Amazon ML tunes and select the best model:

  • Amazon trains several models, each with a different learning rate
  • For a given a learning rate:
    • The training dataset allows the SGD to train the model by finding the best regression coefficients
    • The model is used on the validation dataset to make predictions
  • By comparing the quality of the predictions of the different models on that validation set, Amazon ML is able to select the best model and the associated best learning rate
  • The held-out set is used as final confirmation that the model is reliable

Usual splitting ratios for the training, validation, and held-out subsets are as follows:

  • Training : 70% validation and held-out 15% each
  • Training : 60% validation and held-out 20% each

Shuffling: It is important to make sure that the predictors and the outcome follow the same distribution in all three subsets. Shuffling the data before splitting it is an important part of creating reliable training, validation, and held-out subsets.

It is important to define the data transformations on the training dataset and apply the transformation parameters on the validation and held-out subsets so that the validation and held-out subsets do not leak information back in the training set.

Take standardization as an example: the standard deviation and the mean of the predictors should be calculated on the training dataset. And these values then applied to standardize the validation and held-out sets. If you use the whole original dataset to calculate the mean and SGD, you leak information from the held-out set into the training set.

A common Supervised Predictive Analytics workflow follows these steps - Let's assume we have an already extracted dataset and that we have chosen a metric to assess the quality of our predictions:

  1. Building the dataset
    • Cleaning up and transforming the data to handle noisy data issues
    • Creating new predictors
    • Shuffling and splitting the data into a training, a validation and a held-out set
  2. Selecting the best model
    • Choosing a model (linear, tree-based, bayesian , ...)
    • Repeat for several values of the meta parameters:
      • Train the model on the training set
      • Assess the model performance on the validation set
  3. Repeat steps 1 and 2 with new data, new predictors, and other model parameters until you are satisfied with the performances of your model. Keep the best model.
  4. Final test of the model on the held-out subset.

In the context of Amazon ML, there is no possibility to choose a model (step 2) other than a linear regression one (logistic regression for classification).

主站蜘蛛池模板: 墨竹工卡县| 香格里拉县| 进贤县| 拜城县| 安宁市| 永仁县| 桐乡市| 卫辉市| 平罗县| 钦州市| 化隆| 江北区| 崇礼县| 柳江县| 岳阳市| 富顺县| 浦城县| 长岭县| 潍坊市| 武冈市| 特克斯县| 大同县| 彰武县| 浮山县| 土默特右旗| 遂昌县| 定兴县| 大姚县| 三亚市| 新绛县| 渭南市| 孝昌县| 浦东新区| 射阳县| 肥西县| 沅江市| 阳春市| 罗源县| 唐山市| 酒泉市| 张家界市|