官术网_书友最值得收藏!

Stopwords

By reading this, I would assume the reader is familiar with English. And you may have noticed that some words are used more often than others. Words such as the, there, from, and so on. The task of classifying whether an email is spam or ham is inherently statistical in nature. When certain words are used often in a document (such as an email), it conveys more weight about what that document is about. For example, I received an email today about cats (I am a patron of the Cat Protection Society). The word cat or cats occurred eleven times out of the 120 or so words. It would not be difficult to assume that the email is about cats.

However, the word the showed up 19 times. If we were to classify the topic of the email by a count of words, the email would be classified under the topic the. Connective words such as these are useful in understanding the specific context of the sentences, but for a Na?ve statistical analysis, they often add nothing more than noise. So, we have to remove them.

Stopwords are often specific to projects, and I'm not a particular fan of removing them outright. However, the LingSpam corpus has two variants: stop and lemm_stop, which has the stopwords list applied, and the stopwords removed.

主站蜘蛛池模板: 凉山| 隆昌县| 新营市| 容城县| 罗田县| 灵璧县| 淳化县| 都昌县| 阿克苏市| 台中县| 莒南县| 陇南市| 赣榆县| 麟游县| 定南县| 富民县| 保定市| 东阿县| 皋兰县| 廊坊市| 阿拉善左旗| 海南省| 延川县| 大同市| 泸溪县| 锡林郭勒盟| 崇义县| 抚州市| 卢氏县| 吉木萨尔县| 建瓯市| 兴化市| 响水县| 环江| 中宁县| 昌都县| 高陵县| 谢通门县| 武鸣县| 全南县| 古丈县|