書名： Go Machine Learning Projects
作者名： Xuanyi Chew
本章字?jǐn)?shù)： 186字
更新時間： 2021-06-10 18:46:38

The project

What we want to do is simple: given an email, is it kosher (which we call ham), or is it a spam email? We will be using the LingSpam database. The emails from that database are a little dated—spammers update their techniques and words all the time. However, I chose the LingSpam corpus for a good reason: it is already nicely preprocessed. The original scope of this chapter was to introduce the preprocessing of emails; however, the topic of preprocessing options for natural language is itself a topic for an entire book, so we will use a dataset that has already been preprocessed. This allows us to focus more on the mechanics of a very elegant algorithm.

Fear not, though, as I will actually walk through the brief basics of preprocessing. Be warned, however, that the level of complexity jumps up in a very steep curve, so be prepared to be sucked into a black hole of many hours on preprocessing natural language. At the end of this chapter, I will also recommend some libraries that will be useful for preprocessing.

官术网_书友最值得收藏!

Go Machine Learning Projects

The project