官术网_书友最值得收藏!

Classification - Spam Email Detection

What makes you you? I have dark hair, pale skin, and Asiatic features. I wear glasses. My facial structure is vaguely round, with extra subcutaneous fat in my cheeks compared to my peers. What I have done is describe the features of my face. Each of these features described can be thought of as a point within a probability continuum. What is the probability of having dark hair? Among my friends, dark hair is a very common feature, and so are glasses (a remarkable statistic is out of the 300 people or so I polled on my Facebook page, 281 of them require prescription glasses). The epicanthic folds of my eyes are probably less common, as is the extra subcutaneous fat in my cheeks.

Why am I bringing up my facial features in a chapter about spam classification? It's because the principles are the same. If I show you a photo of a human face, what is the probability that the photo is of me? We can say that the probability that the photo is a photo of my face is a combination of the probability of having dark hair, the probability of having pale skin, the probability of having an epicanthic fold, and so on, and so forth. From a Naive point of view, we can think of each of the features independently contributing to the probability that the photo is me—the fact that I have an epicanthic fold in my eyes is independent from the fact that my skin is of a yellow pallor. But, of course, with recent advancements in genetics, this has been shown to be patently untrue. These features are, in real life, correlated with one another. We will explore this in a future chapter.

Despite a real-life dependence of probability, we can still assume the Naive position and think of these probabilities as independent contributions to the probability that the photo is one of my face.

In this chapter, we will build a email spam classification system using a Naive Bayes algorithm, which can be used beyond email spam classification. Along the way, we will explore the very basics of natural language processing, and how probability is inherently tied to the very language we use. A probabilistic understanding of language will be built up from the ground with the introduction of the term frequency-inverse document frequency (TF-IDF), which will then be translated into Bayesian probabilities, which is used to classify the emails.

主站蜘蛛池模板: 娄底市| 宜宾市| 孟村| 金沙县| 徐水县| 威信县| 南乐县| 信阳市| 闵行区| 本溪市| 板桥市| 宁远县| 肥乡县| 甘肃省| 马公市| 会泽县| 辽宁省| 汽车| 永康市| 凤山县| 昆明市| 和田县| 岳西县| 蒲江县| 华阴市| 张家口市| 东海县| 汉川市| 安泽县| 东乌珠穆沁旗| 饶平县| 乌兰县| 来凤县| 桐庐县| 平塘县| 六安市| 彰化县| 台北市| 北辰区| 鹿邑县| 杂多县|