官术网_书友最值得收藏!

Removing stop words

Commonly used words in English such as the, is, he, and so on, are generally called stop words. Other languages have similar commonly used words that fall under the same category. Stop word removal is another common preprocessing step for an NLP application. In this step, we remove words that do not signify any importance to the document, such as grammar articles and pronouns. Some examples of such words are a, an, he, and her. By themselves, these words may not have an impact on NLP tasks, such as text categorization or search, as they are frequently used throughout the text. Let us look at a sample of stop words in the English language, in the following code:

>>> from nltk.corpus import stopwords
>>> sw_l = stopwords.words('english')
>>> sw_l[20:40]
['himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this']

The preceding code output only shows some of the sample stop words in English, as we have printed only the first 20 items. We will look at how these words can be removed from the text in the following code:

>> example_text = "This is an example sentence to test stopwords"
>>> example_text_without_stopwords=[word for word in example_text.split() if word not in sw_l]
>>> example_text_without_stopwords
['This', 'example', 'sentence', 'test', 'stopwords']

As you can see, some of the articles, such as anis, and to, are removed. NLTK provides stop word corpora for 21 languages, in addition to those for the English language, described in the examples here. As another example, we can also look at the percentage of stop words in a specific text corpus, using the following code:

>> from nltk.corpus import gutenberg
>>> words_in_hamlet = gutenberg.words('shakespeare-hamlet.txt')
>>> words_in_hamlet_without_sw = [word for word in words_in_hamlet if word not in sw_l]
>>> len(words_in_hamlet_without_sw)*100.0/len(words_in_hamlet)
69.26124197002142

The preceding example shows that a significant percentage (approximately 30%) of the text in Shakespeare's Hamlet is formed of stop words. In many of the NLP tasks, these stop words do not have much significance, and therefore, they can be removed during the preprocessing.

主站蜘蛛池模板: 修文县| 吉木萨尔县| 刚察县| 沭阳县| 成安县| 平果县| 乳山市| 临西县| 永修县| 花垣县| 洪洞县| 博爱县| 区。| 灵丘县| 临朐县| 罗平县| 霍州市| 九龙坡区| 临颍县| 长宁区| 寻乌县| 炎陵县| 当阳市| 万州区| 九龙坡区| 遂溪县| 凤庆县| 武陟县| 仪征市| 铁岭市| 家居| 旬邑县| 大姚县| 永和县| 白银市| 峨眉山市| 北流市| 霍州市| 衡阳市| 鹤岗市| 新化县|