- Python Machine Learning By Example
- Yuxi (Hayden) Liu
- 310字
- 2021-07-02 12:41:39
Corpus
As of 2018, NLTK comes with over 100 collections of large and well-structured text datasets, which are called corpora in NLP. Corpora can be used as dictionaries for checking word occurrences and as training pools for model learning and validating. Some useful and interesting corpora include Web Text corpus, Twitter samples, Shakespeare corpus sample, Sentiment Polarity, Names corpus (it contains lists of popular names, which we will be exploring very shortly), WordNet, and the Reuters benchmark corpus. The full list can be found at http://www.nltk.org/nltk_data. Before using any of these corpus resources, we need to first download them by running the following codes in the Python interpreter:
>>> import nltk
>>> nltk.download()
A new window will pop up and ask us which collections (the Collections tab in the following screenshot) or corpus (the Corpora tab in the following screenshot) to download, and where to keep the data:

Installing the whole popular package is the quick solution, since it contains all important corpora needed for your current study and future research. Installing a particular corpora, as shown in the following screenshot, is also fine:

Once the package or corpus you want to explore is installed, we can now take a look at the Names corpus (make sure the names corpus is installed).
First, import the corpus names:
>>> from nltk.corpus import names
We can check out the first 10 names in the list:
>>> print(names.words()[:10])
['Abagael', 'Abagail', 'Abbe', 'Abbey', 'Abbi', 'Abbie',
'Abby', 'Abigael', 'Abigail', 'Abigale']
There are, in total, 7944 names, as shown in the following output derived by executing the following command:
>>> print(len(names.words()))
7944
Other corpora are also fun to explore.
Besides the easy-to-use and abundant corpora pool, more importantly, NLTK is also good at many NLP and text analysis tasks including tokenization, PoS tagging, named entities recognition, word stemming, and lemmatization.
- 嵌入式系統(tǒng)及其開(kāi)發(fā)應(yīng)用
- Seven NoSQL Databases in a Week
- Dreamweaver CS3網(wǎng)頁(yè)制作融會(huì)貫通
- 一本書(shū)玩轉(zhuǎn)數(shù)據(jù)分析(雙色圖解版)
- 新手學(xué)電腦快速入門
- Learning C for Arduino
- 影視后期編輯與合成
- 學(xué)練一本通:51單片機(jī)應(yīng)用技術(shù)
- Flink原理與實(shí)踐
- Mastering Ansible(Second Edition)
- Spark大數(shù)據(jù)商業(yè)實(shí)戰(zhàn)三部曲:內(nèi)核解密|商業(yè)案例|性能調(diào)優(yōu)
- 人工智能:智能人機(jī)交互
- 計(jì)算機(jī)硬件技術(shù)基礎(chǔ)學(xué)習(xí)指導(dǎo)與練習(xí)
- 工業(yè)機(jī)器人應(yīng)用系統(tǒng)三維建模
- Oracle 11g基礎(chǔ)與提高