官术网_书友最值得收藏!

NLP for machine learning

Unlike humans, computers do not understand text – at least not in the same way that we do. In order to create machine learning models that are able to learn from data, we must first learn to represent natural language in a way that computers are able to process.

When we discussed machine learning fundamentals, you may have noticed that loss functions all deal with numerical data so as to be able to minimize loss. Because of this, we wish to represent our text in a numerical format that can form the basis of our input into a neural network. Here, we will cover a couple of basic ways of numerically representing our data. 

Bag-of-words

The first and most simple way of representing text is by using a bag-of-words representation. This method simply counts the words in a given sentence or document and counts all the words. These counts are then transformed into a vector where each element of the vector is the count of the times each word in the corpus appears within the sentence. The corpus is simply all the words that appear across all the sentences/documents being analyzed. Take the following two sentences:

The cat sat on the mat

The dog sat on the cat

We can represent each of these sentences as a count of words:

Figure 1.15 – Table of word counts

Then, we can transform these into inpidual vectors: 

The cat sat on the mat -> [2,1,0,1,1,1]

The dog sat on the cat -> [2,1,1,1,1,0]

This numeric representation could then be used as the input features to a machine learning model where the feature vector is .

Sequential representation

We will see later in this book that more complex neural network models, including RNNs and LSTMs, do not just take a single vector as input, but can take a whole sequence of vectors in the form of a matrix. Because of this, in order to better capture the order of words and thus the meaning of any sentence, we are able to represent this in the form of a sequence of one-hot encoded vectors:

Figure 1.16 – One-hot encoded vectors

主站蜘蛛池模板: 突泉县| 杭州市| 潞城市| 罗山县| 邢台市| 忻州市| 根河市| 新巴尔虎右旗| 新源县| 翁牛特旗| 会泽县| 博客| 闸北区| 竹溪县| 辽源市| 图片| 芦山县| 开江县| 武汉市| 枞阳县| 阿城市| 石景山区| 汽车| 湘潭县| 东海县| 莱阳市| 龙泉市| 宝坻区| 吕梁市| 滨州市| 苏尼特右旗| 土默特右旗| 大竹县| 青河县| 靖州| 安岳县| 鞍山市| 阳春市| 灌云县| 平和县| 渝北区|