- Hands-On Natural Language Processing with PyTorch 1.x
- Thomas Dop
- 364字
- 2022-08-25 16:45:14
NLP for machine learning
Unlike humans, computers do not understand text – at least not in the same way that we do. In order to create machine learning models that are able to learn from data, we must first learn to represent natural language in a way that computers are able to process.
When we discussed machine learning fundamentals, you may have noticed that loss functions all deal with numerical data so as to be able to minimize loss. Because of this, we wish to represent our text in a numerical format that can form the basis of our input into a neural network. Here, we will cover a couple of basic ways of numerically representing our data.
Bag-of-words
The first and most simple way of representing text is by using a bag-of-words representation. This method simply counts the words in a given sentence or document and counts all the words. These counts are then transformed into a vector where each element of the vector is the count of the times each word in the corpus appears within the sentence. The corpus is simply all the words that appear across all the sentences/documents being analyzed. Take the following two sentences:
The cat sat on the mat
The dog sat on the cat
We can represent each of these sentences as a count of words:

Figure 1.15 – Table of word counts
Then, we can transform these into inpidual vectors:
The cat sat on the mat -> [2,1,0,1,1,1]
The dog sat on the cat -> [2,1,1,1,1,0]
This numeric representation could then be used as the input features to a machine learning model where the feature vector is .
Sequential representation
We will see later in this book that more complex neural network models, including RNNs and LSTMs, do not just take a single vector as input, but can take a whole sequence of vectors in the form of a matrix. Because of this, in order to better capture the order of words and thus the meaning of any sentence, we are able to represent this in the form of a sequence of one-hot encoded vectors:

Figure 1.16 – One-hot encoded vectors
- Instant uTorrent
- 電腦組裝、維護、維修全能一本通(全彩版)
- Intel FPGA/CPLD設計(高級篇)
- 現代辦公設備使用與維護
- 平衡掌控者:游戲數值經濟設計
- 深入淺出SSD:固態存儲核心技術、原理與實戰(第2版)
- 面向對象分析與設計(第3版)(修訂版)
- 電腦高級維修及故障排除實戰
- Spring Cloud微服務和分布式系統實踐
- Blender Game Engine:Beginner's Guide
- IP網絡視頻傳輸:技術、標準和應用
- Mastering Quantum Computing with IBM QX
- USB應用開發寶典
- Zabbix 4 Network Monitoring
- 從企業級開發到云原生微服務:Spring Boot實戰