官术网_书友最值得收藏!

Text corpus or corpora

The language data that all NLP tasks depend upon is called the text corpus or simply corpus. A corpus is a large set of text data that can be in one of the languages like English, French, and so on. The corpus can consist of a single document or a bunch of documents. The source of the text corpus can be social network sites like Twitter, blog sites, open discussion forums like Stack Overflow, books, and several others. In some of the tasks like machine translation, we would require a multilingual corpus. For example we might need both the English and French translations of the same document content for developing a machine translation model. For speech tasks, we would also need human voice recordings and the corresponding transcribed corpus. 

In most of the later chapters, we will be using text corpus and speech recordings available from the internet or open source data repositories. For many of the NLP task, the corpus is split into chunks for further analysis. These chunks could be at the paragraph, sentence, or word level. We will touch upon these in the following sections.

主站蜘蛛池模板: 谢通门县| 平南县| 娱乐| 桦川县| 双峰县| 彭山县| 汉川市| 临武县| 万年县| 峨边| 宜宾市| 观塘区| 娄烦县| 都江堰市| 青铜峡市| 河南省| 汝阳县| 蓝田县| 正宁县| 德惠市| 玉屏| 堆龙德庆县| 岗巴县| 庄河市| 如东县| 海兴县| 南部县| 阳新县| 随州市| 福海县| 武穴市| 察哈| 阿拉尔市| 信丰县| 大城县| 临高县| 靖江市| 咸宁市| 华宁县| 井陉县| 西藏|