- Natural Language Processing with TensorFlow
- Thushan Ganegedara
- 1555字
- 2021-06-25 21:28:23
Word2vec – a neural network-based approach to learning word representation
"You shall know a word by the company it keeps."
– J.R. Firth
This statement, uttered by J.R. Firth in 1957, lies at the very foundation of Word2vec, as Word2vec techniques use the context of a given word to learn its semantics. Word2vec is a groundbreaking approach that allows to learn the meaning of words without any human intervention. Also, Word2vec learns numerical representations of words by looking at the words surrounding a given word.
We can test the correctness of the preceding quote by imagining a real-world scenario. Imagine you are sitting for an exam and you find this sentence in your first question: "Mary is a very stubborn child. Her pervicacious nature always gets her in trouble." Now, unless you are very clever, you might not know what pervicacious means. In such a situation, you automatically will be compelled to look at the phrases surrounding the word of interest. In our example, pervicacious is surrounded by stubborn, nature, and trouble. Looking at these three words is enough to determine that pervicacious in fact means a state of being stubborn. I think this is adequate evidence to observe the importance of context for a word's meaning.
Now let's discuss the basics of Word2vec. As already mentioned, Word2vec learns the meaning of a given word by looking at its context and representing it numerically. By context, we refer to a fixed number of words in front of and behind the word of interest. Let's take a hypothetical corpus with N words. Mathematically, this can be represented by a sequence of words denoted by w 0, w 1, …, wi, and w N, where wi is the ith word in the corpus.
Next, if we want to find a good algorithm that is capable of learning word meanings, given a word, our algorithm should be able to predict the context words correctly. This means that the following probability should be high for any given word wi:

To arrive at the right-hand side of the equation, we need to assume that given the target word (wi), the context words are independent of each other (for example, w i-2 and w i-1 are independent). Though not entirely true, this approximation makes the learning problem practical and works well in practice.
Exercise: is queen = king – he + she?
Before proceeding further, let's do a small exercise to understand how maximizing the previously-mentioned probability leads to finding good meaning (or representations) of words. Consider the following very small corpus:
There was a very rich king. He had a beautiful queen. She was very kind.
Now let's do some manual preprocessing and remove the punctuation and the uninformative words:
was rich king he had beautiful queen she was kind
Now let's form a set of tuples for each word with their context words in the format (target word → context word 1, context word 2). We will assume a context window size of 1 on either side:
was → rich
rich → was, king
king → rich, he
he → king, had
had → he, beautiful
beautiful → had, queen
queen → beautiful, she
she → queen, was
was → she, kind
kind → was
Remember, our goal is to be able to predict the words on the right, provided the word at the left is given. To do this, for a given word, the words on the right-side context should share a high numerical or geometrical similarity with the words on the left-side context. In other words, the word of interest should be conveyed by the surrounding word. Now let's assume actual numerical vectors to understand how this works. For simplicity, let's only consider the tuples highlighted in bold. Let's begin by assuming the following for the word rich:
rich → [0,0]
To be able to correctly predict was and king from rich, was and king should have high similarity with the word rich. Let's assume the Euclidean distance between vectors as the similarity product.
Let's try the following values for the words king and rich:
king → [0,1]
was → [-1,0]
This works out fine as the following:
Dist(rich,king) = 1.0
Dist(rich,was) = 1.0
Here, Dist is the Euclidean distance between two words. This is illustrated in Figure 3.3:

Figure 3.3: The positioning of word vectors for the words "rich", "was" and "king"
Now let's consider the following tuple:
king → rich, he
We have established the relationship between king and rich already. However, it is not done yet; the more we see a relationship, the closer these two words should be. So, let's first adjust the vector of king so that it is a bit closer to rich:
king → [0,0.8]
Next, we will need to add the word he to the picture. The word he should be closer to king. This is all the information that we have right now about the word he:
he → [0.5,0.8]
At this moment, the graph with the words looks like Figure 3.4:

Figure 3.4: The positioning of word vectors for the words "rich", "was", "king," and "he"
Now let's proceed with the next two tuples: queen → beautiful, she and she → queen, was. Note that I have swapped the order of the tuples as this makes it easier for us to understand the example:
she → queen, was
Now, we will have to use our prior knowledge about English to proceed further. It is a reasonable decision to place the word she, which has the same distance as he from the word was because their usage in the context of the word was is equivalent. Therefore, let's use this:
she → [0.5,0.6]
Next, we will use the word queen close to the word she:
queen → [0.0,0.6]
This is illustrated in Figure 3.5:

Figure 3.5: The positioning of word vectors for the words "rich," "was," "king," "he," "she," and "queen"
Next, we only have the following tuple:
queen → beautiful, she
Here, the word beautiful is found. It should have approximately the same distance from the words queen and she. Let's use the following:
beautiful → [0.25,0]
Now we have the following graph depicting the relationships between words. When we observe Figure 3.6, it seems to be a very intuitive representation of the meanings of words:

Figure 3.6: The positioning of word vectors for the words "rich," "was," "king," "he," "she," "queen," and "beautiful"
Now, let's look at the question that has been lurking in our minds since the beginning of this exercise. Are the quantities in this equation equivalent: queen = king – he + she? Well, we've got all the resources that we'll need to solve this mystery now. Let's try the right-hand side of the equation first:
= king – he + she
= [0,0.8] – [0.5,0.8] + [0.5,0.6]
= [0,0.6]
It all works out at the end. If you look at the word vector we obtained for the word queen, you see that this is exactly similar to the answer we deduced earlier.
Note that this is a crude working to show how word embeddings are learned, and this might differ from the exact positions of word embeddings if learned using an algorithm.
However, keep in mind that this is an unrealistically scaled down exercise with regard to what a real-world corpus might look like. So, you will not be able to work out these values by hand just by crunching a dozen numbers. This is where sophisticated function approximators such as neural networks do the job for us. But, to use neural networks, we need to formulate our problem in a mathematically assertive way. However, this is a good exercise that actually shows the power of word vectors.
Designing a loss function for learning word embeddings
The vocabulary for even a simple real-world task can easily exceed 10,000 words. Therefore, we cannot develop word vectors by hand for large text corpora and need to devise a way to automatically find good word embeddings using some machine learning algorithms (for example, neural networks) to perform this laborious task efficiently. Also, to use any sort of machine learning algorithm for any sort of task, we need to define a loss, so completing the task becomes minimizing the loss. Let's define the loss for finding good word embedding vectors.
First, let's recall the equation we discussed at the beginning of this section:

With this equation in mind, we can define a cost function for the neural network:

Remember, is a loss (that is, cost), not a reward. Also, we want to maximize P(wj |wi ). Thus, we need a minus sign in front of the expression to convert it into a cost function.
Now, instead of working with the product operator, let's convert this to log space. Converting the equation to log space will introduce consistency and numerical stability. This gives us the following equation:

This formulation of the cost function is known as the negative log-likelihood.
Now, as we have a well-formulated cost function, a neural network can be used to optimize this cost function. Doing so will force the word vectors or word embeddings to organize themselves well according to their meaning. Now, it is time to learn about the existing algorithms that use this cost function to find good word embeddings.
- Flask Web開發入門、進階與實戰
- TestNG Beginner's Guide
- Mastering Swift 2
- Learning ArcGIS Pro
- Mastering LibGDX Game Development
- Symfony2 Essentials
- Mastering ROS for Robotics Programming
- Android驅動開發權威指南
- C++語言程序設計
- Java Web開發實例大全(基礎卷) (軟件工程師開發大系)
- C++程序設計教程
- Python+Office:輕松實現Python辦公自動化
- 數據分析與挖掘算法:Python實戰
- Unity Android Game Development by Example Beginner's Guide
- Java Web開發基礎與案例教程