官术网_书友最值得收藏!

PoS tagging

We can apply an off-the-shelf tagger from NLTK or combine multiple taggers to customize the tagging process. It is easy to directly use the built-in tagging function, pos_tag, as in: pos_tag(input_tokens), for instance. But behind the scene, it is actually a prediction from a pre-built supervised learning model. The model is trained based on a large corpus composed of words that are correctly tagged.

Reusing an earlier example, we can perform PoS tagging as follows:

>>> import nltk
>>> tokens = word_tokenize(sent)
>>> print(nltk.pos_tag(tokens))
[('I', 'PRP'), ('am', 'VBP'), ('reading', 'VBG'), ('a', 'DT'), ('book', 'NN'), ('.', '.'), ('It', 'PRP'), ('is', 'VBZ'), ('Python', 'NNP'), ('Machine', 'NNP'), ('Learning', 'NNP'), ('By', 'IN'), ('Example', 'NNP'), (',', ','), ('2nd', 'CD'), ('edition', 'NN'), ('.', '.')]

The PoS tag following each token is returned. We can check the meaning of a tag using the help function. Looking up PRP and VBP, for example, gives us the following output:

>>> nltk.help.upenn_tagset('PRP')
PRP: pronoun, personal
hers herself him himself hisself it itself me myself one oneself ours ourselves ownself self she thee theirs them themselves they thou thy us
>>> nltk.help.upenn_tagset('VBP')
VBP: verb, present tense, not 3rd person singular
predominate wrap resort sue twist spill cure lengthen brush terminate appear tend stray glisten obtain comprise detest tease attract emphasize mold postpone sever return wag ...

In spaCy, getting a PoS tag is also easy. The token object parsed from an input sentence has an attribute called pos_, which is the tag we are looking for:

>>> print([(token.text, token.pos_) for token in tokens2])
[('I', 'PRON'), ('have', 'VERB'), ('been', 'VERB'), ('to', 'ADP'), ('U.K.', 'PROPN'), ('and', 'CCONJ'), ('U.S.A.', 'PROPN')]
主站蜘蛛池模板: 江永县| 咸丰县| 西青区| 澄江县| 邯郸县| 新丰县| 米林县| 台州市| 上蔡县| 自贡市| 峨眉山市| 庆阳市| 乌什县| 阜城县| 宝坻区| 玛曲县| 宜宾县| 罗定市| 商城县| 界首市| 正蓝旗| 郧西县| 铁力市| 磴口县| 泽库县| 贞丰县| 杨浦区| 巨鹿县| 营山县| 仪征市| 文化| 蒲城县| 博爱县| 克拉玛依市| 通渭县| 嫩江县| 左权县| 宝丰县| 柳林县| 麻栗坡县| 泰兴市|