官术网_书友最值得收藏!

Recurrent neural networks (RNNs)

Recurrent neural networks (RNNs) are useful in processing sequential or temporal data, where the data at a given instance or position is highly correlated with the data in the previous time steps or positions. RNNs have already been very successful at processing text data, since a word at a given instance is highly correlated with the words preceding it. In an RNN, at each time step, the network performs the same function, hence, the term recurrent in its name. The architecture of an RNN is illustrated in the following diagram:

Figure 1.12: RNN architecture 

At each given time step, t, a memory state, ht, is computed, based on the previous state, ht-1, at step (t-1) and the input, xt, at time step t. The new state, ht, is used to predict the output, ot, at step t. The equations governing RNNs are as follows: 

If we are predicting the next word in a sentence, then the function f2 is generally a softmax function over the words in the vocabulary. The function f1 can be any activation function based on the problem at hand. 

In an RNN, an output error in step t tries to correct the prediction in the previous time steps, generalized by k ∈ 1, 2, . . . t-1, by propagating the error in the previous time steps. This helps the RNN to learn about long dependencies between words that are far apart from each other. In practice, it isn't always possible to learn such long dependencies through RNN because of the vanishing and exploding gradient problems.

As you know, neural networks learn through gradient descent, and the relationship of a word in time step t with a word at a prior sequence step can be learned through the gradient of the memory state  with respect to the gradient of the memory state ? i. This is expressed in the following formula:

  

If the weight connection from the memory state  at the sequence step k to the memory state  at the sequence step (k+1) is given by uii ∈  Whh, then the following is true:  

In the preceding equation,  is the total input to the memory state i at the time step (k+1), such that the following is the case: 

                                                       

Now that we have everything in place, it's easy to see why the vanishing gradient problem may occur in an RNN. From the preceding equations, (3) and (4), we get the following:

For RNNs, the function f2 is generally sigmoid or tanh, which suffers from the saturation problem of having low gradients beyond a specified range of values for the input. Now, since the f2 derivatives are multiplied with each other, the gradient  can become zero if the input to the activation functions is operating at the saturation zone, even for relatively moderate values of (t-k). Even if the f2 functions are not operating in the saturation zone, the gradients of the f2 function for sigmoids are always less than 1, and so it is very difficult to learn distant dependencies between words in a sequence. Similarly, there might be exploding gradient problems stemming from the factor . Suppose that the distance between steps t and k is around 10, while the weight, uii, is around two. In such cases, the gradient would be magnified by a factor of two, 210 = 1024, leading to the exploding gradient problem.

主站蜘蛛池模板: 五大连池市| 新化县| 浦江县| 潼关县| 民县| 枣强县| 临安市| 吉安县| 儋州市| 凤山市| 分宜县| 东源县| 遵化市| 河东区| 陵川县| 柳河县| 康平县| 缙云县| 浮山县| 安丘市| 进贤县| 衡东县| 黄龙县| 永寿县| 黑水县| 山东省| 方正县| 文水县| 阿图什市| 扎赉特旗| 山东省| 鞍山市| 罗平县| 读书| 九龙县| 泗阳县| 平泉县| 德庆县| 固镇县| 吉隆县| 宣汉县|