官术网_书友最值得收藏!

L1 penalty

The basic concept of the L1 penalty, also known as the least-absolute shrinkage and selection operator (LassoHastie, T., Tibshirani, R., and Friedman, J. (2009)), is that a penalty is used to shrink weights toward zero. The penalty term uses the sum of the absolute weights, so some weights may get shrunken to zero. This means that Lasso can also be used as a type of variable selection. The strength of the penalty is controlled by a hyper-parameter, alpha (λ), which multiplies the sum of the absolute weights, and it can be a fixed value or, as with other hyper-parameters, optimized using cross-validation or some similar approach.

It is easier to describe Lasso if we use an ordinary least squares (OLS) regression model. In regression, a set of coefficients or model weights is estimated using the least-squared error criterion, where the weight/coefficient vector, Θ, is estimated such that it minimizes ∑(yi - ?i) where ?i=b+Θx, yi is the target value we want to predict and ?i is the predicted value. Lasso regression adds an additional penalty term that now tries to minimize ∑(yi - ?iλ?Θ?, where ?Θ? is the absolute value of ΘTypically, the intercept or offset term is excluded from this constraint.

There are a number of practical implications for Lasso regression. First, the effect of the penalty depends on the size of the weights, and the size of the weights depends on the scale of the data. Therefore, data is typically standardized to have unit variance first (or at least to make the variance of each variable equal). The L1 penalty has a tendency to shrink small weights to zero (for explanations as to why this happens, see Hastie, T., Tibshirani, R., and Friedman, J. (2009)). If you only consider variables for which the L1 penalty leaves non-zero weights, it can essentially function as feature-selection. The tendency for the L1 penalty to shrink small coefficients to zero can also be convenient for simplifying the interpretation of the model results.

Applying the L1 penalty to neural networks works exactly the same for neural networks as it does for regression. If X represents the input, Y is the outcome or dependent variable, B the parameters, and F the objective function that will be optimized to obtain B, that is, we want to minimize F(B; X, Y). The L1 penalty modifies the objective function to be F(B; X, Y) + λ?Θ?, where Θ represents the weights (typically offsets are ignored). The L1 penalty tends to result in a sparse solution (that is, more zero weights) as small and larger weights result in equal penalties, so that at each update of the gradient, the weights are moved toward zero.

We have only considered the case where λ is a constant, controlling the degree of penalty or regularization. However, it is possible to set different values with deep neural networks, where varying degrees of regularization can be applied to different layers. One reason for considering such differential regularization is that it is sometimes desirable to allow a greater number of parameters (say by including more neurons in a particular layer) but then counteract this somewhat through stronger regularization. However, this approach can be quite computationally demanding if we are allowing the L1 penalty to vary for every layer of a deep neural network and using cross-validation to optimize all possible combinations of the L1 penalty. Therefore, usually a single value is used across the entire model.

主站蜘蛛池模板: 郸城县| 宁国市| 年辖:市辖区| 东乡县| 景洪市| 太谷县| 罗江县| 崇文区| 安庆市| 乐至县| 阿鲁科尔沁旗| 深圳市| 油尖旺区| 石家庄市| 锡林浩特市| 慈溪市| 远安县| 元朗区| 西宁市| 西宁市| 波密县| 铜川市| 林周县| 蒙自县| 江津市| 成武县| 兰西县| 甘德县| 绥芬河市| 义乌市| 神池县| 大埔县| 新建县| 临夏市| 项城市| 彭水| 象州县| 栾川县| 额尔古纳市| 高尔夫| 莒南县|