官术网_书友最值得收藏!

The skip-gram algorithm

The first algorithm we will talk about is known as the skip-gram algorithm. The skip-gram algorithm, introduced by Mikolov and others in 2013, is an algorithm that exploits the context of the words of written text to learn good word embeddings. Let's go through step by step to understand the skip-gram algorithm.

First, we will discuss the data preparation process, followed by an introduction to the notation required to understand the algorithm. Finally, we will discuss the algorithm itself.

As we discussed in numerous places, the meaning of the word can be elicited from the contextual words surrounding that particular word. However, it is not entirely straightforward to develop a model that exploits this property to learning word meanings.

From raw text to structured data

First, we need to design a mechanism to extract a dataset that can be fed to our learning model. Such a dataset should be a set of tuples of the format (input, output). Moreover, this needs to be created in an unsupervised manner. That is, a human should not have to manually engineer the labels for the data. In summary, the data preparation process should do the following:

  • Capture the surrounding words of a given word
  • Perform in an unsupervised manner

The skip-gram model uses the following approach to design such a dataset:

  1. For a given word wi, a context window size m is assumed. By context window size, we mean the number of words considered as context on a single side. Therefore, for wi, the context window (including the target word wi) will be of size 2m+1 and will look like this: [wi-m, …, wi-1, wi, wi+1, …, wi+m].
  2. Next, input-output tuples are formed as […, (wi, wi-m), …,(wi,wi-1), (wi,wi+1), …, (wi,wi+m), …]; here, From raw text to structured data and N is the number of words in the text to get a practical insight. Let's assume the following sentence and context window size (m) of 1: The dog barked at the mailman.

    For this example, the dataset would be as follows:

    [(dog, The), (dog, barked), (barked, dog), (barked, at), …, (the, at), (the, mailman)]

Learning the word embeddings with a neural network

Once the data is in the (input, output) format, we can use a neural network to learn the word embeddings. First, let's identify the variables we need to learn the word embeddings. To store the word embeddings, we need a V × D matrix, where V is the vocabulary size and D is the dimensionality of the word embeddings (that is, the number of elements in the vector that represents a single word). D is a user-defined hyperparameter. The higher D is, the more expressive the word embeddings learned will be. This matrix will be referred to as the embedding space or the embedding layer. Next, we have a softmax layer with weights of size D × V, a bias of size V.

Each word will be represented as a one-hot encoded vector of size V with one element being 1 and all the others being 0. Therefore, an input word and the corresponding output words would each be of size V. Let's refer to the ith input as x i, the corresponding embedding of x i as z i, and the corresponding output as y i.

At this point, we have the necessary variables defined. Next, for each input x i, we will look up the embedding vectors from the embedding layer corresponding to the input. This operation provides us with z i, which is a D-sized vector (that is, a D-long embedding vector). Afterwards, we calculate the prediction output for x i using the following transformation:

logit(x i ) = z i W+b

? i = softmax(logit(x i ))

Here, logit(x i ) represents the unnormalized scores (that is, logits), ? i is the V-sized predicted output (representing the probability of output being a word from the V-sized vocabulary), W is the D × V weight matrix, b is the V × 1 bias-vector, and softmax is the softmax activation. We will visualize both the conceptual (Figure 3.7) and implementation (Figure 3.8) views of the skip-gram model. Here is a summary of the notation:

  • V: This is the size of the vocabulary
  • D: This is the dimensionality of the embedding layer
  • xi: This is the ith input word represented as a one-hot-encoded vector
  • zi: This is the embedding (that is, representation) vector corresponding to the ith input word
  • yi: This is the one-hot-encoded output word corresponding to xi
  • ?i: This is the predicted output for xi
  • logit(xi): This is the unnormalized score for the input xiLearning the word embeddings with a neural network
  • : This is the one-hot-encoded representation for word wj
  • W: This is the softmax weight matrix
  • b: This is the bias of the softmax

    Figure 3.7: The conceptual skip-gram model

Figure 3.8: The implementation of the skip-gram model

Using both the existing and derived entities, we can now use the negative log-likelihood loss function to calculate the loss for a given data point (x i , y i ). If you are wondering what P(w j |w i ) is, it can be derived from the already defined entities. Next, let's discuss how to calculate P(w j |w i ) from ? i and also derive a formal definition for that.

Note

Why does the original word embeddings paper use two embedding layers?

The original paper (by Mikolov, and others, 2013) uses two distinct V × D embedding spaces to denote words in the target space (words when used as the target) and words in the contextual space (words used as context words). One motivation to do this is that the same word does not occur in the context of itself often. So, we want to minimize the probability of such things happening. For example, for the target word dog, it is highly unlikely that the word dog is also found in its context (P(dog|dog) ~ 0). Intuitively, if we feed the (x i =dog and y i =dog) data point to the neural network, we are asking the neural network to give a higher loss if the neural network predicts dog as a context word of dog. In other words, we are asking the word embedding of the word dog to have a very high distance to the word embedding of the word dog. This creates a strong contradiction as the distance between the embedding of the same word will be 0. Therefore, we cannot achieve this if we only have a single embedding space. However, having two separate embedding spaces for target words and contextual words allows us to have this property because this way we have two separate embedding vectors for the same word. However, in practice, as long as you avoid feeding input-output tuples, having the same word as input and output allows us to work with a single embedding space and eliminate the need of two distinct embedding layers.

Formulating a practical loss function

Let's inspect our loss function more closely. We have derived that the loss should be as follows:

However, calculating this particular loss from the entities we have at hand at the moment is not entirely straightforward.

First, let's understand what the P(w j |w i ) entity represents. To do this, we will move from inpidual words notation to an inpidual data points notation. That is, we will say that P(w j , w i ) is given by the n th data point, which has the one-hot encoded vector of wi as the input (x n) and the one-hot encoded representation of wj as the true output (y n). This is given by the following equation:

The logit(x n ) term denotes the unnormalized prediction score (that is, logit) vector (V-sized) obtained for a given input x n and Formulating a practical loss function is the score value corresponding to the non-zeroth index of the one-hot encoded representation of wj (we call this, the index of wj from now onwards). Then, we normalize the logit value at the index of wj with respect to all the logit values corresponding to all the words in the entire vocabulary. This particular type of normalization is known as the softmax activation (or normalization). Now, by converting this to log space, we get the following equation:

To calculate the logit function effectively, we can fiddle with variables and come up with the following notation:

Here, Formulating a practical loss function is the one-hot encoded vector of wj. Now the logit operation has reduced to a sum and product operation. Since Formulating a practical loss function only has a single nonzero element corresponding to the word wj, only that index of the vector will be used in the computation. This is more computationally efficient than finding the value in the logit vector that corresponds to the index of the nonzero element by sweeping through a vector of the size of the vocabulary.

Now, by assigning the logit calculation we obtained, for the loss, we get the following:

Let's consider an example to understand this calculation:

I like NLP

We can create input-output tuples as follows:

(like, I)

(like, NLP)

Now let's assume the following one-hot-encoded representations for the preceding words:

like – 1,0,0

I – 0,1,0

NLP – 0,0,1

Next, let's consider the input-output tuple (like, I). When we propagated the input like through the skip-gram learning model, let's assume that we obtained the following logits for the words like, I, and NLP in that order:

2,10,5

Now softmax outputs for each word in the vocabulary will be as follows:

P(like|like) = exp(2)/(exp(2)+exp(10)+exp(5)) = 0.118

P(I|like) = exp(10)/ (exp(2)+exp(10)+exp(5)) = 0.588

P(NLP|like) = exp(5)/ (exp(2)+exp(10)+exp(5)) = 0.294

The preceding loss function says that we need to maximize P(I|like) to minimize the loss. Now let's apply our example to this loss function:

=- ( [0,1,0] * ([2, 10, 5]) - log(exp([1,0,0]*[2, 10, 5]) + exp([0,1,0]*[2, 10, 5]) + exp([0,0,1]*[2, 10, 5])))

=- (10 - log(exp(2)+exp(10)+exp(5))) = 0.007

With this loss function, for the term before the minus sign, there is only a single nonzero element in the y vector corresponding to the word I. Therefore, we will only be considering the probability P(I|like), which is exactly what we wanted.

However, this is not the ideal solution we were looking for. The objective of this loss function from a practical perspective, we want to maximize the probability of predicting a contextual word given a word, while minimizing the probability of "all" the noncontextual words, given a word. We will soon see that having a well-defined loss function will not solve our problem effectively in practice. We will need to devise a more clever approximate loss function to learn good word embeddings in a feasible time duration.

Efficiently approximating the loss function

We are fortunate to have a loss function that is solid both mathematically and intuitively. However, hard work does not end here. If we try to calculate the loss function in closed form as we discussed earlier, we will face an inevitable tremendous slowness of our algorithm.

This slowness is due to the large vocabulary causing a performance bottleneck. Let's have a look at our cost function:

You will see that computing the loss for a single example requires computing logits for all the words in the vocabulary. Unlike computer vision problems, where a few hundreds of output classes is more than adequate to solve most of the existing real-world problems, skip-gram does not boast such properties. Therefore, we need to turn our heads towards efficient approximations of the loss without losing efficacy of our model.

We will discuss two popular choices of approximations:

  • Negative sampling
  • Hierarchical softmax
Negative sampling of the softmax layer

Here we will discuss our first approach: negative sampling the softmax layer. Negative sampling is an approximation of the Noise-Contrastive Estimation (NCE) method. NCE says that a good model should differentiate data from noise by means of logistic regression.

With this property in mind, let's reformulate our objective of learning word embeddings. We do not require a full-probabilistic model, which has the exact probabilities of all words in the vocabulary for a given word. What we need are high-quality word vectors. Therefore, we can simplify our problem to differentiating actual data (that is, input-output pairs) from noise (that is, K-many imaginary noise input-output pairs). By noise, we refer to false input-output pairs created using words that do not fall within the context of a given word. We will also get rid of the softmax activation and replace it with a sigmoid activation (also known as the logistic function). This allows us to remove the dependency of the cost function, on the full vocabulary while keeping output between [0,1]. We can visualize the negative sample process in Figure 3.9.

Precisely, our original loss function is given by the following equation:

The preceding formula becomes this:

Here, σ denotes the sigmoid activation, where σ(x)=1/(1+exp(-x)). Note that I have replaced logit(x n ) wj with a Negative sampling of the softmax layer in the original loss function, for clarity. You can see that the new loss function depends only on the calculations related to k items from the vocabulary.

After some simplification, we arrive at the following equation:

Let's take a moment to understand what this equation says. To simplify things let's assume k=1. This gives us the following equation:

Here, wj represents a context word of wi and w q represents a noncontext word of wi. What this equation essentially says is that, to minimize J(θ), we should make Negative sampling of the softmax layer, which means Negative sampling of the softmax layer needs to be a large positive value. Then, Negative sampling of the softmax layer means that Negative sampling of the softmax layer needs to be a large negative value. In other words, for true data points representing true target words and context words should get large positive values and fake data points representing target words and noise should get large negative values. This is the same behavior you would get with a softmax function, but with more computational efficiency.

Figure 3.9: The negative sampling process

Here, Negative sampling of the softmax layer is the sigmoid activation. Intuitively, we do the following two steps in our loss function calculation:

  • Calculating the loss for the nonzero column of wj (pushing towards positive)
  • Calculating the loss for K-many noise samples (pulling towards negative)
Hierarchical softmax

Hierarchical softmax is slightly more complex than negative sampling, but serves the same objective as the negative sampling; that is, approximating the softmax without having to calculate activations for all the words in the vocabulary for all the training samples. However, unlike negative sampling, hierarchical softmax uses only the actual data and does not need noise samples. We can visualize the hierarchical softmax model in Figure 3.10.

To understand hierarchical softmax, let's consider an example:

I like NLP. Deep learning is amazing.

The vocabulary for this is as follows:

I, like, NLP, Deep, learning, is, amazing

With this vocabulary, we will build a binary tree, where all the words in the vocabulary are present as leaf nodes. We will also add a special token PAD to make sure that all the tree leaves have two members:

Figure 3.10: Hierarchical softmax

Then, our last hidden layer will be fully connected to all the nodes in the hierarchy (see Figure 3.11). Note that this model has similar amount of total weights compared with a classical softmax layer; however, it uses only a subset of them for a given calculation:

Figure 3.11: How the hierarchical softmax connects to the embedding layer

Let's say that we need to infer the probability of P(NLP|like), where like is the input word. Then we only need a subset of the weights to calculate the probability, as shown in Figure 3.12:

Figure 3.12: Calculating probabilities with the hierarchical softmax

Concretely, here is how the probability is calculated:

Since now we know how to calculate P(w j |w i ), we can use the original loss function. Note that this method uses only the weights connected to the nodes in the path for calculation, resulting in a high computational efficiency.

Learning the hierarchy

Though hierarchical softmax is efficient, an important question remains unanswered. How do we determine the decomposition of the tree? More precisely, which word will follow which branch? There are a few options to achieve this:

  • Initialize the hierarchy randomly: This method does have some performance degradations as the random placement cannot be guaranteed to have the best branching possible among words.
  • Use WordNet to determine the hierarchy: WordNet can be utilized to determine a suitable order for the words in the tree. This method has shown to perform significantly better than the random initialization.
Optimizing the learning model

Since we own a well-formulated loss function, the optimization is a matter of calling the correct function from the TensorFlow library. The optimization process that will be used is a stochastic optimization process, meaning that we do not feed the full dataset at once, but only a random batch of data for many steps.

Implementing skip-gram with TensorFlow

We will now walk through an implementation of the skip-gram algorithm that uses the TensorFlow library. Here we will only be discussing the essential parts of defining the required TensorFlow operations to learn the embeddings, not running the operations. The full exercise is available in ch3_word2vec.ipynb in the ch3 exercise directory.

First let's define the hyperparameters of the model. You are free to change these hyperparameters to see how they affect final performance (for example, batch_size = 16 or batch_size = 256). However, since this is a simple problem compared with the more complex real-word problems, you might not see any significant differences (unless you change them to extremes, for example, batch_size = 1 or num_sampled = 1):

batch_size = 128
embedding_size = 128 # Dimension of the embedding vector.
window_size = 4 # How many words to consider left and right.
valid_size = 16 # Random set of words to evaluate similarity on.
# Only pick dev samples in the head of the distribution.
valid_window = 100 
valid_examples = get_common_and_rare_word_ids(valid_size//2,valid_size//2)
num_sampled = 32 # Number of negative examples to sample.

Next, define TensorFlow placeholders for training inputs, labels, and valid inputs:

train_dataset = tf.placeholder(tf.int32, shape=[batch_size])
train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
valid_dataset = tf.constant(valid_examples, dtype=tf.int32)

Then, define the TensorFlow variables for the embedding layer and softmax weights and bias:

embeddings = tf.Variable(
  tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
softmax_weights = tf.Variable(
  tf.truncated_normal([vocabulary_size, embedding_size],
stddev=0.5 / math.sqrt(embedding_size)))
softmax_biases =
  tf.Variable(tf.random_uniform([vocabulary_size],0.0,0.01))

Next, we will define an embedding lookup operation that gathers the corresponding embeddings of a given batch of training inputs:

embed = tf.nn.embedding_lookup(embeddings, train_dataset)

Afterwards, we will define the softmax loss, using a negative sampling:

loss = tf.reduce_mean(
  tf.nn.sampled_softmax_loss(weights=softmax_weights,
  biases=softmax_biases, inputs=embed,
  labels=train_labels, num_sampled=num_sampled,
  num_classes=vocabulary_size))

Here we define an optimizer to optimize (minimize) the preceding defined loss function. Feel free to experiment with other optimizers listed at https://www.tensorflow.org/api_guides/python/train:

optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)

Compute the similarity between validation input examples and all embeddings. The cosine distance is used:

norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keepdims=True))
normalized_embeddings = embeddings / norm
valid_embeddings = tf.nn.embedding_lookup(
  normalized_embeddings, valid_dataset)
similarity = tf.matmul(valid_embeddings,
  tf.transpose(normalized_embeddings))

With all the TensorFlow variables and operations defined, we can now move onto executing the operations to get some results. Here we will outline the basic procedure for executing these operations. You can refer to the exercise file for a complete view of the execution.

  • First initialize the TensorFlow variables with tf.global_variables_initializer().run()
  • For each step (for a predefined number of total steps), do the following:
    • Generate a batch of data (batch_data – inputs, batch_labels – outputs) using the data generator
    • Create a dictionary called feed_dict that maps train input/output placeholders to data generated by the data generator:
      feed_dict = {train_dataset : batch_data, train_labels : batch_labels}
    • Execute an optimization step and obtain the loss value as follows:
      _, l = session.run([optimizer, loss], feed_dict=feed_dict)

We will now discuss another popular Word2vec algorithm known as the Continuous Bag-of-Words (CBOW) model.

主站蜘蛛池模板: 龙里县| 红安县| 莎车县| 崇义县| 吴忠市| 大余县| 石台县| 古蔺县| 东莞市| 临夏县| 育儿| 若羌县| 铜山县| 永丰县| 龙南县| 滦南县| 收藏| 桃源县| 康乐县| 日喀则市| 易门县| 金寨县| 本溪| 建水县| 梧州市| 德兴市| 宝应县| 乡宁县| 永胜县| 莱西市| 吉木乃县| 托克逊县| 林州市| 平安县| 固始县| 铜山县| 增城市| 花莲县| 金平| 桦南县| 和平县|