- Hands-On Natural Language Processing with Python
- Rajesh Arumugam Rajalingappaa Shanmugamani
- 194字
- 2021-08-13 16:01:37
Text corpus or corpora
The language data that all NLP tasks depend upon is called the text corpus or simply corpus. A corpus is a large set of text data that can be in one of the languages like English, French, and so on. The corpus can consist of a single document or a bunch of documents. The source of the text corpus can be social network sites like Twitter, blog sites, open discussion forums like Stack Overflow, books, and several others. In some of the tasks like machine translation, we would require a multilingual corpus. For example we might need both the English and French translations of the same document content for developing a machine translation model. For speech tasks, we would also need human voice recordings and the corresponding transcribed corpus.
In most of the later chapters, we will be using text corpus and speech recordings available from the internet or open source data repositories. For many of the NLP task, the corpus is split into chunks for further analysis. These chunks could be at the paragraph, sentence, or word level. We will touch upon these in the following sections.
- JavaScript百煉成仙
- Vue.js入門與商城開發實戰
- PostgreSQL技術內幕:事務處理深度探索
- 人人都是網站分析師:從分析師的視角理解網站和解讀數據
- Learning ArcGIS for Desktop
- Android開發:從0到1 (清華開發者書庫)
- Active Directory with PowerShell
- Java并發編程之美
- 深入淺出Python數據分析
- Qt 5.12實戰
- 微信小程序開發邊做邊學(微課視頻版)
- C語言程序設計教程
- 城市信息模型平臺頂層設計與實踐
- Mastering PostgreSQL 11(Second Edition)
- R for Data Science Cookbook