- Natural Language Processing with TensorFlow
- Thushan Ganegedara
- 2156字
- 2021-06-25 21:28:23
Classical approaches to learning word representation
In this section, we will discuss some of the classical approaches used for numerically representing words. These approaches mainly can be categorized into two classes: approaches that use external resources for representing words and approaches that do not. First, we will discuss WordNet—one of the most popular external resource-based approaches for representing words. Then we will proceed to more localized methods (that is, those that do not rely on external resources), such as one-hot encoding and Term Frequency-Inverse Document Frequency (TF-IDF).
WordNet – using an external lexical knowledge base for learning word representations
WordNet is one of the most popular classical approaches or statistical NLP that deals with word representations. It relies on an external lexical knowledge base that encodes the information about the definition, synonyms, ancestors, descendants, and so forth of a given word. WordNet allows a user to infer various information for a given word, such as the aspects of a word discussed in the preceding sentence and the similarity between two words.
Tour of WordNet
As already mentioned, WordNet is a lexical database, encoding part-of-speech tag relationships between words including nouns, verbs, adjectives, and adverbs. WordNet was pioneered by the Department of Psychology of Princeton University, United States, and it is currently hosted at the Department of Computer Science of Princeton University. WordNet considers the synonymy between words to evaluate the relationship between words. The English WordNet currently hosts more than 150,000 words and more than 100,000 synonym groups (that is, synsets). Also, WordNet is not just restricted to English. A multitude of different wordnets have been founded since its inception and can be viewed at http://globalwordnet.org/wordnets-in-the-world/.
In order to understand how to leverage WordNet, it is important to lay a solid ground on the terminology used in WordNet. First, WordNet uses the term synset to denote a group or set of synonyms. Next, each synset has a definition that explains what the synset represents. Synonyms contained within a synset are called lemmas.
In WordNet, the word representations are modeled hierarchically, which forms a complex graph between a given synset and the associations to another synset. These associations can be of two different categories: an is-a relationship or an is-made-of relationship. First, we will discuss the is-a association.
For a given synset, there exist two categories of relations: hypernyms and hyponyms. Hypernyms of a synset are the synsets that carry a general (high-level) meaning of the considered synset. For example, vehicle is a hypernym of the synset car. Next, hyponyms are synsets that are more specific than the corresponding synset. For example, Toyota car is a hyponym of the synset car.
Now let's discuss the is-made-of relationships for a synset. Holonyms of a synset are the group of synsets that represents the whole entity of the considered synset. For example, a holonym of tires is the cars synset. Meronyms are an is-made-of category and represent the opposite of holonyms, where meronyms are the parts or substances synset that makes the corresponding synset. We can visualize this in Figure 3.2:

Figure 3.2: The various associations that exist for a synset
The NLTK library, a Python natural language processing library, can be used to understand WordNet and its mechanisms. The full example is available as an exercise in the ch3_wordnet.ipynb
file located in the ch3
folder.
Note
Installing the NLTK Library
To install the NLTK library to Python, you can use the following Python pip
command:
pip install nltk
Alternatively, you can use an IDE (such as PyCharm) to install the library through the Graphical User Interface (GUI). You can find more detailed instructions at http://www.nltk.org/install.html.
To import NLTK into Python and download the WordNet corpus, first import the nltk
library:
import nltk
Then you can download the WordNet corpus by running the following command:
nltk.download('wordnet')
After the nltk
library is installed and imported, we need to import the WordNet corpus with this command:
from nltk.corpus import wordnet as wn
Then we can query the WordNet corpus as follows:
# retrieves all the available synsets word = 'car' car_syns = wn.synsets(word) # The definition of each synset of car synsets syns_defs = [car_syns[i].definition() for i in range(len(car_syns))] # Get the lemmas for the first Synset car_lemmas = car_syns[0].lemmas()[:3] # Let's get hypernyms for a Synset (general superclass) syn = car_syns[0] print('\t',syn.hypernyms()[0].name(),'\n') # Let's get hyponyms for a Synset (specific subclass) syn = car_syns[0] print('\t',[hypo.name() for hypo in syn.hyponyms()[:3]],'\n') # Let's get part-holonyms for the third "car" # Synset (specific subclass) syn = car_syns[2] print('\t',[holo.name() for holo in syn.part_holonyms()],'\n') # Let's get meronyms for a Synset (specific subclass) syn = car_syns[0] print('\t',[mero.name() for mero in syn.part_meronyms()[:3]],'\n')
After running the example, the results will look like this:
All the available Synsets for car [Synset('car.n.01'), Synset('car.n.02'), Synset('car.n.03'), Synset('car.n.04'), Synset('cable_car.n.01')] Example definitions of available synsets: car.n.01 : a motor vehicle with four wheels; usually propelled by an internal combustion engine car.n.02 : a wheeled vehicle adapted to the rails of railroad car.n.03 : the compartment that is suspended from an airship and that carries personnel and the cargo and the power plant Example lemmas for the Synset car.n.03 ['car', 'auto', 'automobile'] Hypernyms of the Synset car.n.01 motor_vehicle.n.01 Hyponyms of the Synset car.n.01 ['ambulance.n.01', 'beach_wagon.n.01', 'bus.n.04'] Holonyms (Part) of the Synset car.n.03 ['airship.n.01'] Meronyms (Part) of the Synset car.n.01 ['accelerator.n.01', 'air_bag.n.01', 'auto_accessory.n.01']
We can also obtain the similarities between two synsets in the following way. There are several different similarity metrics implemented in NLTK, and you can see them in action on the official website (www.nltk.org/howto/wordnet.html). Here, we use the Wu-Palmer similarity, which measures the similarity between two synsets based on their depth in the hierarchical organization of the synsets:
sim = wn.wup_similarity(w1_syns[0], w2_syns[0])
Problems with WordNet
Though WordNet is an amazing resource that anyone can use to learn meanings of word in the NLP tasks, there are quite a few drawbacks in using WordNet for this. They are as follows:
- Missing nuances is a key problem in WordNet. There are both theoretical and practical reasons why this is not viable for WordNet. From a theoretical perspective, it is not well-posed or direct to model the definition of the subtle difference between two entities. Practically speaking, defining nuances is subjective. For example, the words want and need have similar meanings, but one of them (need) is more assertive. This is considered to be a nuance.
- Next, WordNet is subjective in itself as WordNet was designed by a relatively small community. Therefore, depending on what you are trying to solve, WordNet might be suitable or you might be able to perform better with a loose definition of words.
- There also exists the issue of maintaining WordNet, which is labor-intensive. Maintaining and adding new synsets, definitions, lemmas, and so on, can be very expensive. This adversely affects the scalability of WordNet, as human labor is essential to keep WordNet up to date.
- Developing WordNet for other languages can be costly. There are also some efforts to build WordNet for other languages and link it with the English WordNet as MultiWordNet (MWN), but they are yet incomplete.
Next, we will discuss several word representation techniques that do not rely on external resources.
One-hot encoded representation
A simpler way of representing words is to use the one-hot encoded representation. This means that if we have a vocabulary of V size, for each ith word wi, we will represent the word wi with a V-long vector [0, 0, 0, …, 0, 1, 0, …, 0, 0, 0] where the ith element is 1 and other elements are zero. As an example, consider this sentence:
Bob and Mary are good friends.
The one-hot encoded representation for each word might look like this:
Bob: [1,0,0,0,0,0]
and: [0,1,0,0,0,0]
Mary: [0,0,1,0,0,0]
are: [0,0,0,1,0,0]
good: [0,0,0,0,1,0]
friends: [0,0,0,0,0,1]
However, as you might have already figured out, this representation has many drawbacks.
This representation does not encode the similarity between words in any way and completely ignores the context in which the words are used. Let's consider the dot product between the word vectors as the similarity measure. The more similar two vectors are, the higher the dot product is for those two vectors. For example, the representation of the words car and automobile will have a similarity distance of 0, where car and pencil will also have the same value.
This method becomes extremely ineffective for large vocabularies. Also, for a typical NLP task, the vocabulary easily can exceed 50,000 words. Therefore, the word representation matrix for 50,000 words will result in a very sparse 50,000 × 50,000 matrix.
However, one-hot encoding plays an important role even in the state-of-the-art word embedding learning algorithms. We use one-hot encoding to represent words numerically and feed them into neural networks so that the neural networks can learn better and smaller numerical feature representations of the words.
Note
One-hot encoding is also known as a localist representation (opposite of the distributed representation), as the feature representation is decided by the activation of a single element in the vector.
The TF-IDF method
TF-IDF is a frequency-based method that takes into account the frequency with which a word appears in a corpus. This is a word representation in the sense that it represents the importance of a specific word in a given document. Intuitively, the higher the frequency of the word, the more important that word is in the document. For example, in a document about cats, the word cats will appear more. However, just calculating the frequency would not work, because words such as this and is are very frequent but do not carry that much information. TF-IDF takes this into consideration and gives a value of zero for such common words.
Again, TF stands for term frequency and IDF stands for inverse document frequency:
TF(w i ) = number of times w i appear / total number of words
IDF(w i ) = log(total number of documents / number of documents with w i in it)
TF-IDF(w i ) = TF(w i ) x IDF(w i )
Let's do a quick exercise. Consider two documents:
- Document 1: This is about cats. Cats are great companions.
- Document 2: This is about dogs. Dogs are very loyal.
Now let's crunch some numbers:
TF-IDF (cats, doc1) = (2/8) * log(2/1) = 0.075
TF-IDF (this, doc2) = (1/8) * log(2/2) = 0.0
Therefore, the word cats is informative while this is not. This is the desired behavior we needed in terms of measuring the importance of words.
Co-occurrence matrix
Co-occurrence matrices, unlike one-hot-encoded representation, encodes the context information of words, but requires maintaining a V × V matrix. To understand the co-occurrence matrix, let's take two example sentences:
- Jerry and Mary are friends.
- Jerry buys flowers for Mary.
The co-occurrence matrix will look like the following matrix. We only show one triangle of the matrix, as the matrix is symmetric:

However, it is not hard to see that maintaining such a co-occurrence matrix comes at a cost as the size of the matrix grows polynomially with the size of the vocabulary. Furthermore, it is not straightforward to incorporate a context window size larger than 1. One option is to have a weighted count, where the weight for a word in the context deteriorates with the distance from the word of interest.
All these drawbacks motivate us to investigate more principled, robust, and scalable ways of learning and inferring meanings (that is, representations) of words.
Word2vec is a recently-introduced distributed word representation learning technique that is currently being used as a feature engineering technique for many NLP tasks (for example, machine translation, chatbots, and image caption generators). Essentially, Word2vec learns word representations by looking at the surrounding words (that is, context) in which the word is used. More specifically, we attempt to predict the context, given some words (or vice versa), through a neural network, which leads the neural network to be forced to learn good word embeddings. We will discuss this method in detail in the next section. The Word2vec approach has many advantages over the previously-described methods. They are as follows:
- The Word2vec approach is not subjective to the human knowledge of language as in the WordNet-based approach.
- Word2vec representation vector size is independent of the vocabulary size unlike one-hot encoded representation or the word co-occurrence matrix.
- Word2vec is a distributed representation. Unlike localist representation, where the representation depends on the activation of a single element of the representation vector (for example, one-hot encoding), the distributed representation depends on the activation pattern of all the elements in the vector. This gives more expressive power to Word2vec than produced by the one-hot encoded representation.
In the following section, we will first develop some intuitive feeling about learning word embeddings by working through an example. Then we will define a loss function so that we can use machine learning to learn word embeddings. Also, we will discuss two Word2vec algorithms, namely, the skip-gram and Continuous Bag-of-Words (CBOW) algorithms.