- Hands-On Natural Language Processing with PyTorch 1.x
- Thomas Dop
- 364字
- 2022-08-25 16:45:14
NLP for machine learning
Unlike humans, computers do not understand text – at least not in the same way that we do. In order to create machine learning models that are able to learn from data, we must first learn to represent natural language in a way that computers are able to process.
When we discussed machine learning fundamentals, you may have noticed that loss functions all deal with numerical data so as to be able to minimize loss. Because of this, we wish to represent our text in a numerical format that can form the basis of our input into a neural network. Here, we will cover a couple of basic ways of numerically representing our data.
Bag-of-words
The first and most simple way of representing text is by using a bag-of-words representation. This method simply counts the words in a given sentence or document and counts all the words. These counts are then transformed into a vector where each element of the vector is the count of the times each word in the corpus appears within the sentence. The corpus is simply all the words that appear across all the sentences/documents being analyzed. Take the following two sentences:
The cat sat on the mat
The dog sat on the cat
We can represent each of these sentences as a count of words:

Figure 1.15 – Table of word counts
Then, we can transform these into inpidual vectors:
The cat sat on the mat -> [2,1,0,1,1,1]
The dog sat on the cat -> [2,1,1,1,1,0]
This numeric representation could then be used as the input features to a machine learning model where the feature vector is .
Sequential representation
We will see later in this book that more complex neural network models, including RNNs and LSTMs, do not just take a single vector as input, but can take a whole sequence of vectors in the form of a matrix. Because of this, in order to better capture the order of words and thus the meaning of any sentence, we are able to represent this in the form of a sequence of one-hot encoded vectors:

Figure 1.16 – One-hot encoded vectors
- 圖解西門子S7-200系列PLC入門
- Linux KVM虛擬化架構實戰指南
- Getting Started with Qt 5
- 單片機原理及應用系統設計
- 電腦軟硬件維修從入門到精通
- 微服務分布式架構基礎與實戰:基于Spring Boot + Spring Cloud
- Mastering Adobe Photoshop Elements
- 龍芯自主可信計算及應用
- Neural Network Programming with Java(Second Edition)
- 單片機技術及應用
- LPC1100系列處理器原理及應用
- 筆記本電腦維修技能實訓
- 微服務架構實戰:基于Spring Boot、Spring Cloud、Docker
- ARM接口編程
- ActionScript Graphing Cookbook