官术网_书友最值得收藏!

What is POS tagging?

POS refers to categorizing the words in a sentence into specific syntactic or grammatical functions. In English, the main parts of speech are nouns, pronouns, adjectives, verbs, adverbs, prepositions, determiners, and conjunctions. POS tagging is the task of attaching one of these categories to each of the words or tokens in a text. NLTK provides both a set of tagged text corpus and a set of POS trainers for creating custom taggers. The most common tagged datasets in NLTK are the Penn Treebank and Brown Corpus. The Penn Treebank consists of a parsed collection of texts from journal articles, telephone conversations, and so on. Similarly, the Brown Corpus consists of text from 15 different categories of articles (science, politics, religion, sports, and so on). This text data provides very fine granularity tagging, while many applications might need only the following universal tag set:

  • VERB: Verbs (all tenses and modes)
  • NOUN: Nouns (common and proper)
  • PRON: Pronouns
  • ADJ: Adjectives
  • ADV: Adverbs
  • ADP: Adpositions (prepositions and postpositions)
  • CONJ: Conjunctions
  • DET: Determiners
  • NUM: Cardinal numbers
  • PRT: Particles or other function words
  • X-other: Foreign words, typos, abbreviations
  • .: Punctuation

NLTK also provides mapping from a tagged corpus (such as the Brown Corpus) to the universal tags, as shown in the following code. The Brown Corpus has a finer granularity of POS tags than the universal tag set. For example, the tags VBD (for past tense verb) and VB (for base form verb) map to just VERB in the universal tag set:

>>> from nltk.corpus import brown
>>> brown.tagged_words()[30:40]
[('term-end', 'NN'), ('presentments', 'NNS'), ('that', 'CS'), ('the', 'AT'), ('City', 'NN-TL'), ('Executive', 'JJ-TL'), ('Committee', 'NN-TL'), (',', ','), ('which', 'WDT'), ('had', 'HVD')]
>>> brown.tagged_words(tagset='universal')[30:40]
[('term-end', 'NOUN'), ('presentments', 'NOUN'), ('that', 'ADP'), ('the', 'DET'), ('City', 'NOUN'), ('Executive', 'ADJ'), ('Committee', 'NOUN'), (',', '.'), ('which', 'DET'), ('had', 'VERB')]

Here, you can see that the word City is tagged as NP-TL, which is a proper noun (NP) appearing in the context of a title (TL) in the Brown Corpus. This is mapped to NOUN in the universal tag set. Some NLP tasks may need these fine-grained categories, instead of the general universal tags.

主站蜘蛛池模板: 绿春县| 措美县| 彭州市| 育儿| 万荣县| 天祝| 囊谦县| 石河子市| 江阴市| 宣化县| 敖汉旗| 敦煌市| 武汉市| 齐河县| 敦化市| 东源县| 鄂托克前旗| 遵化市| 湘潭市| 阿拉善右旗| 广南县| 卢湾区| 金堂县| 中超| 安达市| 贞丰县| 镇远县| 景谷| 锡林浩特市| 阳东县| 姚安县| 吕梁市| 江安县| 若羌县| 同德县| 扶余县| 芜湖县| 武宁县| 诸城市| 凤翔县| 宜兰市|