官术网_书友最值得收藏!

Training deep networks

As we mentioned in chapter 2, Neural Networks, we can use different algorithms to train a neural network. But in practice, we almost always use Stochastic Gradient Descent (SGD) and backpropagation, which we introduced in Chapter 2, Neural Networks. In a way, this combination has withstood the test of time, outliving other algorithms, such as DBNs. With that said, gradient descent has some extensions worth discussing.

In the following section, we'll introduce momentum, which is an effective improvement over the vanilla gradient descent. You may recall the weight update rule that we introduced in Chapter 2Neural Networks:

  1. , where λ is the learning rate.

To include momentum, we'll add another parameter to this equation. 

  1. First, we'll calculate the weight update value:
  1. Then, we'll update the weight:

From the preceding equation, we see that the first component, , is the momentum. The  represents the previous value of the weight update and μ is the coefficient, which will determine how much the new value depends on the previous ones. To explain this, let's look at the following diagram, where you will see a comparison between vanilla SGD and SGD with momentum. The concentric ellipses represent the surface of the error function, where the innermost ellipse is the minimum and the outermost the maximum. Think of the loss function surface as the surface of a hill. Now, imagine that we are holding a ball at the top of the hill (maximum). If we drop the ball, thanks to Earth's gravity, it will start rolling toward the bottom of the hill (minimum). The more distance it travels, the more its speed will increase. In other words, it will gain momentum (hence the name of the optimization). As a result, it will reach the bottom of the hill faster. If, for some reason, gravity didn't exist, the ball would roll at its initial speed and it would reach the bottom more slowly: 

A comparison between vanilla SGD and SGD + momentum

In your practice, you may encounter other gradient descent optimizations, such as Nesterov momentum, ADADELTA https://arxiv.org/abs/1212.5701, RMSProp https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf, and Adam https://arxiv.org/abs/1412.6980. Some of these will be discussed in later chapters of the book.

主站蜘蛛池模板: 五常市| 滨海县| 泰安市| 绥宁县| 尤溪县| 米泉市| 商洛市| 精河县| 瑞昌市| 五指山市| 青铜峡市| 得荣县| 墨江| 金秀| 富民县| 锡林浩特市| 响水县| 建宁县| 铁力市| 达拉特旗| 莱芜市| 阳山县| 辰溪县| 尚志市| 洪江市| 平乐县| 长治县| 龙州县| 宜宾县| 女性| 岗巴县| 富阳市| 交口县| 陈巴尔虎旗| 鹤山市| 农安县| 沂南县| 柏乡县| 安顺市| 手游| 仙桃市|