官术网_书友最值得收藏!

Text corpus or corpora

The language data that all NLP tasks depend upon is called the text corpus or simply corpus. A corpus is a large set of text data that can be in one of the languages like English, French, and so on. The corpus can consist of a single document or a bunch of documents. The source of the text corpus can be social network sites like Twitter, blog sites, open discussion forums like Stack Overflow, books, and several others. In some of the tasks like machine translation, we would require a multilingual corpus. For example we might need both the English and French translations of the same document content for developing a machine translation model. For speech tasks, we would also need human voice recordings and the corresponding transcribed corpus. 

In most of the later chapters, we will be using text corpus and speech recordings available from the internet or open source data repositories. For many of the NLP task, the corpus is split into chunks for further analysis. These chunks could be at the paragraph, sentence, or word level. We will touch upon these in the following sections.

主站蜘蛛池模板: 怀宁县| 安泽县| 沙田区| 双流县| 永善县| 六枝特区| 崇阳县| 紫云| 喀什市| 柳江县| 介休市| 肃宁县| 年辖:市辖区| 东乡县| 台中市| 绥中县| 鹰潭市| 揭西县| 郯城县| 漯河市| 合川市| 松滋市| 柳林县| 柳河县| 中牟县| 靖安县| 永清县| 民勤县| 遂宁市| 涪陵区| 贵港市| 定兴县| 通道| 保定市| 定结县| 孟州市| 万年县| 青州市| 府谷县| 炉霍县| 乌鲁木齐市|