官术网_书友最值得收藏!

Classification - Spam Email Detection

What makes you you? I have dark hair, pale skin, and Asiatic features. I wear glasses. My facial structure is vaguely round, with extra subcutaneous fat in my cheeks compared to my peers. What I have done is describe the features of my face. Each of these features described can be thought of as a point within a probability continuum. What is the probability of having dark hair? Among my friends, dark hair is a very common feature, and so are glasses (a remarkable statistic is out of the 300 people or so I polled on my Facebook page, 281 of them require prescription glasses). The epicanthic folds of my eyes are probably less common, as is the extra subcutaneous fat in my cheeks.

Why am I bringing up my facial features in a chapter about spam classification? It's because the principles are the same. If I show you a photo of a human face, what is the probability that the photo is of me? We can say that the probability that the photo is a photo of my face is a combination of the probability of having dark hair, the probability of having pale skin, the probability of having an epicanthic fold, and so on, and so forth. From a Naive point of view, we can think of each of the features independently contributing to the probability that the photo is me—the fact that I have an epicanthic fold in my eyes is independent from the fact that my skin is of a yellow pallor. But, of course, with recent advancements in genetics, this has been shown to be patently untrue. These features are, in real life, correlated with one another. We will explore this in a future chapter.

Despite a real-life dependence of probability, we can still assume the Naive position and think of these probabilities as independent contributions to the probability that the photo is one of my face.

In this chapter, we will build a email spam classification system using a Naive Bayes algorithm, which can be used beyond email spam classification. Along the way, we will explore the very basics of natural language processing, and how probability is inherently tied to the very language we use. A probabilistic understanding of language will be built up from the ground with the introduction of the term frequency-inverse document frequency (TF-IDF), which will then be translated into Bayesian probabilities, which is used to classify the emails.

主站蜘蛛池模板: 横山县| 青冈县| 乌恰县| 亚东县| 博乐市| 莱州市| 军事| 眉山市| 来凤县| 永吉县| 辉县市| 太保市| 鹤峰县| 鄂州市| 吉木乃县| 中方县| 堆龙德庆县| 八宿县| 彭阳县| 阳泉市| 通化市| 津市市| 瑞金市| 陆川县| 古蔺县| 东城区| 延寿县| 冷水江市| 建宁县| 通许县| 淄博市| 仙桃市| 云阳县| 海口市| 湖北省| 田阳县| 高唐县| 长垣县| 宁武县| 徐水县| 小金县|