开元棋牌正版

書名： Natural Language Processing with TensorFlow
作者名： Thushan Ganegedara
本章字數(shù)： 821字
更新時間： 2021-06-25 21:28:24

The original skip-gram algorithm

The skip-gram algorithm discussed up to this point in the book is actually an improvement over the original skip-gram algorithm proposed in the original paper by Mikolov and others, published in 2013. In this paper, the algorithm did not use an intermediate hidden layer to learn the representations. In contrast, the original algorithm used two different embedding or projection layers (the input and output embeddings in Figure 4.1) and defined a cost function derived from the embeddings themselves:

Figure 4.1: The original skip-gram algorithm without hidden layers

The original negative sampled loss was defined as follows:

Here, v is the input embeddings layer, v' is the output word embeddings layer, The original skip-gram algorithm corresponds to the embedding vector for the word w_i in the input embeddings layer and corresponds to the word vector for the word w_i in the output embeddings layer.

is the noise distribution, from which we sample noise samples (for example, it can be as simple as uniformly sampling from vocabulary—{w _i ,w _j}, as we saw in Chapter 3, Word2vec – Learning Word Embeddings). Finally, E denotes the expectation (average) of the loss obtained from k-negative samples. As you can see, there are no weights and bias in this equation except for the word embeddings themselves.

Implementing the original skip-gram algorithm

Implementing the original skip-gram algorithm is not as straightforward as the version we have already implemented. This is because the loss function needs to be handcrafted using TensorFlow functions as there is no built-in function for calculating the loss as we had for the other algorithms.

First, let's define placeholders for the following:

Input data: This is a placeholder containing a batch of target words of the [batch_size] shape
Output data: This is a placeholder containing the corresponding context words for the batch of target words and is of size, [batch_size, 1]
```
train_dataset = tf.placeholder(tf.int32, shape=[batch_size])
train_labels = tf.placeholder(tf.int64, shape=[batch_size, 1])
```

With the input and output placeholders defined, we can use a TensorFlow built-in candidate_sampler to sample negative samples as shown in the following code:

negative_samples, _, _ = tf.nn.log_uniform_candidate_sampler(
                                      train_labels, num_true=1,
                                      num_sampled=num_sampled,
                                      unique=True,
                                      range_max=vocabulary_size)

Here we sample negative words uniformly without any special preference for different words. train_labels are the true samples, so TensorFlow can avoid producing them as negative samples. Then we have the number of num_true, which denotes number of true classes for a given data point, which is 1. Next comes the number of negative samples we want for a batch of data (num_sampled). unique defines whether the negative samples should be unique. Finally, range defines the maximum ID a word has, so that the sampler doesn't produce any invalid word IDs.

We get rid of the softmax weights and biases. Then, we introduce two embedding layers, one for the input data and the other for the output data. Two embedding layers are needed because if we had only one embedding layer, the cost function would not work, as discussed in Chapter 3, Word2vec – Learning Word Embeddings.

Let's embed lookups for the input data, output data, and negative samples:

in_embed = tf.nn.embedding_lookup(in_embeddings, train_dataset)
out_embed = tf.nn.embedding_lookup(out_embeddings, tf.reshape(
                                      train_labels,[-1]))
negative_embed = tf.nn.embedding_lookup(out_embeddings,
                                           negative_samples)

Next, we will define the loss function, and it is the most important part of the code. This code implements the loss function we discussed earlier. However, as we defined in the loss function Implementing the original skip-gram algorithm , we do not calculate the loss for all the words in a document at once. This is due to the fact that a document can be too large to fit into the memory fully. Therefore, we calculate the loss for small batches of data at a single time step. The full code is available in the ch4_word2vec_improvements.ipynb exercise book located in the ch4 folder:

# Computing the loss for the positive sample
loss = tf.reduce_mean(
    tf.log(
        tf.nn.sigmoid(
            tf.reduce_sum(
                tf.diag([1.0 for _ in range(batch_size)])*
                tf.matmul(out_embed,tf.transpose(in_embed)),
            axis=0)
        )
    )
)

# Computing loss for the negative samples
loss += tf.reduce_mean(
    tf.reduce_sum(
        tf.log(tf.nn.sigmoid(
            -tf.matmul(negative_embed,tf.transpose(in_embed)))),
        axis=0
    )
)

Note

Tensorflow implements sampled_softmax_loss by defining a smaller subset of weights and biases that are only required to process the current batch of data, from the full softmax weights and biases. Thereafter, TensorFlow computes the loss similar to the standard softmax cross entropy calculation. However, we cannot directly translate that approach to calculate the original skip-gram loss as there are no softmax weights and biases.

Comparing the original skip-gram with the improved skip-gram

We should have a good reason to use a hidden layer in contrast to the original skip-gram algorithm which does not use one. Therefore, we will observe the loss function behavior of the original skip-gram algorithm and the hidden-layer-including skip-gram algorithm in Figure 4.2:

Figure 4.2: The original skip-gram algorithm versus the improved skip-gram algorithm

We can clearly see that having a hidden layer leads to better performance compared with not having one. This also suggest that deeper Word2vec models tend to perform better.

官术网_书友最值得收藏!

Natural Language Processing with TensorFlow

The original skip-gram algorithm

Implementing the original skip-gram algorithm

Note

Comparing the original skip-gram with the improved skip-gram