官术网_书友最值得收藏!

NLP for machine learning

Unlike humans, computers do not understand text – at least not in the same way that we do. In order to create machine learning models that are able to learn from data, we must first learn to represent natural language in a way that computers are able to process.

When we discussed machine learning fundamentals, you may have noticed that loss functions all deal with numerical data so as to be able to minimize loss. Because of this, we wish to represent our text in a numerical format that can form the basis of our input into a neural network. Here, we will cover a couple of basic ways of numerically representing our data. 

Bag-of-words

The first and most simple way of representing text is by using a bag-of-words representation. This method simply counts the words in a given sentence or document and counts all the words. These counts are then transformed into a vector where each element of the vector is the count of the times each word in the corpus appears within the sentence. The corpus is simply all the words that appear across all the sentences/documents being analyzed. Take the following two sentences:

The cat sat on the mat

The dog sat on the cat

We can represent each of these sentences as a count of words:

Figure 1.15 – Table of word counts

Then, we can transform these into inpidual vectors: 

The cat sat on the mat -> [2,1,0,1,1,1]

The dog sat on the cat -> [2,1,1,1,1,0]

This numeric representation could then be used as the input features to a machine learning model where the feature vector is .

Sequential representation

We will see later in this book that more complex neural network models, including RNNs and LSTMs, do not just take a single vector as input, but can take a whole sequence of vectors in the form of a matrix. Because of this, in order to better capture the order of words and thus the meaning of any sentence, we are able to represent this in the form of a sequence of one-hot encoded vectors:

Figure 1.16 – One-hot encoded vectors

主站蜘蛛池模板: 习水县| 商城县| 景洪市| 珲春市| 荥阳市| 永安市| 江城| 同心县| 枣庄市| 湾仔区| 武鸣县| 吉林省| 文成县| 博乐市| 锦州市| 临夏市| 民权县| 理塘县| 姜堰市| 静宁县| 石首市| 灵武市| 靖江市| 齐河县| 阿拉善左旗| 唐海县| 饶河县| 大英县| 叶城县| 彰化市| 靖江市| 阳泉市| 高阳县| 长岭县| 临泉县| 张家界市| 措美县| 平罗县| 营山县| 昭苏县| 永平县|