- Natural Language Processing with TensorFlow
- Thushan Ganegedara
- 821字
- 2021-06-25 21:28:24
The original skip-gram algorithm
The skip-gram algorithm discussed up to this point in the book is actually an improvement over the original skip-gram algorithm proposed in the original paper by Mikolov and others, published in 2013. In this paper, the algorithm did not use an intermediate hidden layer to learn the representations. In contrast, the original algorithm used two different embedding or projection layers (the input and output embeddings in Figure 4.1) and defined a cost function derived from the embeddings themselves:

Figure 4.1: The original skip-gram algorithm without hidden layers
The original negative sampled loss was defined as follows:

Here, v is the input embeddings layer, v' is the output word embeddings layer, corresponds to the embedding vector for the word wi in the input embeddings layer and
corresponds to the word vector for the word wi in the output embeddings layer.

is the noise distribution, from which we sample noise samples (for example, it can be as simple as uniformly sampling from vocabulary—{w i ,w j}, as we saw in Chapter 3, Word2vec – Learning Word Embeddings). Finally, E denotes the expectation (average) of the loss obtained from k-negative samples. As you can see, there are no weights and bias in this equation except for the word embeddings themselves.
Implementing the original skip-gram algorithm
Implementing the original skip-gram algorithm is not as straightforward as the version we have already implemented. This is because the loss function needs to be handcrafted using TensorFlow functions as there is no built-in function for calculating the loss as we had for the other algorithms.
First, let's define placeholders for the following:
- Input data: This is a placeholder containing a batch of target words of the
[batch_size]
shape - Output data: This is a placeholder containing the corresponding context words for the batch of target words and is of size,
[batch_size, 1]
train_dataset = tf.placeholder(tf.int32, shape=[batch_size]) train_labels = tf.placeholder(tf.int64, shape=[batch_size, 1])
With the input and output placeholders defined, we can use a TensorFlow built-in candidate_sampler
to sample negative samples as shown in the following code:
negative_samples, _, _ = tf.nn.log_uniform_candidate_sampler( train_labels, num_true=1, num_sampled=num_sampled, unique=True, range_max=vocabulary_size)
Here we sample negative words uniformly without any special preference for different words. train_labels
are the true samples, so TensorFlow can avoid producing them as negative samples. Then we have the number of num_true
, which denotes number of true classes for a given data point, which is 1. Next comes the number of negative samples we want for a batch of data (num_sampled
). unique
defines whether the negative samples should be unique. Finally, range
defines the maximum ID a word has, so that the sampler doesn't produce any invalid word IDs.
We get rid of the softmax weights and biases. Then, we introduce two embedding layers, one for the input data and the other for the output data. Two embedding layers are needed because if we had only one embedding layer, the cost function would not work, as discussed in Chapter 3, Word2vec – Learning Word Embeddings.
Let's embed lookups for the input data, output data, and negative samples:
in_embed = tf.nn.embedding_lookup(in_embeddings, train_dataset) out_embed = tf.nn.embedding_lookup(out_embeddings, tf.reshape( train_labels,[-1])) negative_embed = tf.nn.embedding_lookup(out_embeddings, negative_samples)
Next, we will define the loss function, and it is the most important part of the code. This code implements the loss function we discussed earlier. However, as we defined in the loss function , we do not calculate the loss for all the words in a document at once. This is due to the fact that a document can be too large to fit into the memory fully. Therefore, we calculate the loss for small batches of data at a single time step. The full code is available in the
ch4_word2vec_improvements.ipynb
exercise book located in the ch4
folder:
# Computing the loss for the positive sample loss = tf.reduce_mean( tf.log( tf.nn.sigmoid( tf.reduce_sum( tf.diag([1.0 for _ in range(batch_size)])* tf.matmul(out_embed,tf.transpose(in_embed)), axis=0) ) ) ) # Computing loss for the negative samples loss += tf.reduce_mean( tf.reduce_sum( tf.log(tf.nn.sigmoid( -tf.matmul(negative_embed,tf.transpose(in_embed)))), axis=0 ) )
Note
Tensorflow implements sampled_softmax_loss
by defining a smaller subset of weights and biases that are only required to process the current batch of data, from the full softmax weights and biases. Thereafter, TensorFlow computes the loss similar to the standard softmax cross entropy calculation. However, we cannot directly translate that approach to calculate the original skip-gram loss as there are no softmax weights and biases.
Comparing the original skip-gram with the improved skip-gram
We should have a good reason to use a hidden layer in contrast to the original skip-gram algorithm which does not use one. Therefore, we will observe the loss function behavior of the original skip-gram algorithm and the hidden-layer-including skip-gram algorithm in Figure 4.2:

Figure 4.2: The original skip-gram algorithm versus the improved skip-gram algorithm
We can clearly see that having a hidden layer leads to better performance compared with not having one. This also suggest that deeper Word2vec models tend to perform better.
- Qt 5 and OpenCV 4 Computer Vision Projects
- 大學計算機應用基礎實踐教程
- Redis Applied Design Patterns
- Swift細致入門與最佳實踐
- UML2面向對象分析與設計(第2版)
- 跟戴銘學iOS編程:理順核心知識點
- Simulation for Data Science with R
- Arduino電子設計實戰指南:零基礎篇
- JavaScript Concurrency
- Java EE實用教程
- 絕密原型檔案:看看專業產品經理的原型是什么樣
- Apache Kafka 1.0 Cookbook
- PHP高性能開發:基礎、框架與項目實戰
- Mastering Responsive Web Design
- 給產品經理講技術