- Go Machine Learning Projects
- Xuanyi Chew
- 415字
- 2021-06-10 18:46:38
Classification - Spam Email Detection
What makes you you? I have dark hair, pale skin, and Asiatic features. I wear glasses. My facial structure is vaguely round, with extra subcutaneous fat in my cheeks compared to my peers. What I have done is describe the features of my face. Each of these features described can be thought of as a point within a probability continuum. What is the probability of having dark hair? Among my friends, dark hair is a very common feature, and so are glasses (a remarkable statistic is out of the 300 people or so I polled on my Facebook page, 281 of them require prescription glasses). The epicanthic folds of my eyes are probably less common, as is the extra subcutaneous fat in my cheeks.
Why am I bringing up my facial features in a chapter about spam classification? It's because the principles are the same. If I show you a photo of a human face, what is the probability that the photo is of me? We can say that the probability that the photo is a photo of my face is a combination of the probability of having dark hair, the probability of having pale skin, the probability of having an epicanthic fold, and so on, and so forth. From a Naive point of view, we can think of each of the features independently contributing to the probability that the photo is me—the fact that I have an epicanthic fold in my eyes is independent from the fact that my skin is of a yellow pallor. But, of course, with recent advancements in genetics, this has been shown to be patently untrue. These features are, in real life, correlated with one another. We will explore this in a future chapter.
Despite a real-life dependence of probability, we can still assume the Naive position and think of these probabilities as independent contributions to the probability that the photo is one of my face.
In this chapter, we will build a email spam classification system using a Naive Bayes algorithm, which can be used beyond email spam classification. Along the way, we will explore the very basics of natural language processing, and how probability is inherently tied to the very language we use. A probabilistic understanding of language will be built up from the ground with the introduction of the term frequency-inverse document frequency (TF-IDF), which will then be translated into Bayesian probabilities, which is used to classify the emails.
- 電氣自動化專業英語(第3版)
- Mastering Spark for Data Science
- Windows XP中文版應用基礎
- 流處理器研究與設計
- 21天學通ASP.NET
- OpenStack Cloud Computing Cookbook(Second Edition)
- 電腦主板現場維修實錄
- 工業機器人運動仿真編程實踐:基于Android和OpenGL
- 單片機技術一學就會
- Photoshop行業應用基礎
- Salesforce Advanced Administrator Certification Guide
- 網絡存儲·數據備份與還原
- Mastering MongoDB 3.x
- 和機器人一起進化
- 3ds Max造型表現藝術