官术网_书友最值得收藏!

Stopwords

By reading this, I would assume the reader is familiar with English. And you may have noticed that some words are used more often than others. Words such as the, there, from, and so on. The task of classifying whether an email is spam or ham is inherently statistical in nature. When certain words are used often in a document (such as an email), it conveys more weight about what that document is about. For example, I received an email today about cats (I am a patron of the Cat Protection Society). The word cat or cats occurred eleven times out of the 120 or so words. It would not be difficult to assume that the email is about cats.

However, the word the showed up 19 times. If we were to classify the topic of the email by a count of words, the email would be classified under the topic the. Connective words such as these are useful in understanding the specific context of the sentences, but for a Na?ve statistical analysis, they often add nothing more than noise. So, we have to remove them.

Stopwords are often specific to projects, and I'm not a particular fan of removing them outright. However, the LingSpam corpus has two variants: stop and lemm_stop, which has the stopwords list applied, and the stopwords removed.

主站蜘蛛池模板: 松原市| 奈曼旗| 永修县| 大足县| 五台县| 山阴县| 湖南省| 津市市| 内江市| 鄯善县| 淮南市| 萍乡市| 射阳县| 丁青县| 容城县| 西充县| 乳山市| 龙门县| 古蔺县| 达拉特旗| 常德市| 丽江市| 西林县| 公安县| 衡水市| 祁门县| 吉首市| 得荣县| 昭苏县| 赤城县| 广德县| 隆回县| 盐山县| 阳曲县| 台南县| 长治县| 长阳| 青铜峡市| 凤冈县| 周口市| 鲁山县|