- Go Machine Learning Projects
- Xuanyi Chew
- 227字
- 2021-06-10 18:46:39
Stopwords
By reading this, I would assume the reader is familiar with English. And you may have noticed that some words are used more often than others. Words such as the, there, from, and so on. The task of classifying whether an email is spam or ham is inherently statistical in nature. When certain words are used often in a document (such as an email), it conveys more weight about what that document is about. For example, I received an email today about cats (I am a patron of the Cat Protection Society). The word cat or cats occurred eleven times out of the 120 or so words. It would not be difficult to assume that the email is about cats.
However, the word the showed up 19 times. If we were to classify the topic of the email by a count of words, the email would be classified under the topic the. Connective words such as these are useful in understanding the specific context of the sentences, but for a Na?ve statistical analysis, they often add nothing more than noise. So, we have to remove them.
Stopwords are often specific to projects, and I'm not a particular fan of removing them outright. However, the LingSpam corpus has two variants: stop and lemm_stop, which has the stopwords list applied, and the stopwords removed.
- Seven NoSQL Databases in a Week
- Learning Social Media Analytics with R
- 精通Windows Vista必讀
- ServiceNow Cookbook
- MicroPython Projects
- 模型制作
- 21天學(xué)通ASP.NET
- CorelDRAW X4中文版平面設(shè)計(jì)50例
- 完全掌握AutoCAD 2008中文版:綜合篇
- 工業(yè)機(jī)器人維護(hù)與保養(yǎng)
- 空間機(jī)械臂建模、規(guī)劃與控制
- Linux Shell編程從初學(xué)到精通
- 一步步寫嵌入式操作系統(tǒng)
- 寒江獨(dú)釣:Windows內(nèi)核安全編程
- 數(shù)據(jù)要素:全球經(jīng)濟(jì)社會發(fā)展的新動力