官术网_书友最值得收藏!

The project 

What we want to do is simple: given an email, is it kosher (which we call ham), or is it a spam email? We will be using the LingSpam database. The emails from that database are a little dated—spammers update their techniques and words all the time. However, I chose the LingSpam corpus for a good reason: it is already nicely preprocessed. The original scope of this chapter was to introduce the preprocessing of emails; however, the topic of preprocessing options for natural language is itself a topic for an entire book, so we will use a dataset that has already been preprocessed. This allows us to focus more on the mechanics of a very elegant algorithm.

Fear not, though, as I will actually walk through the brief basics of preprocessing. Be warned, however, that the level of complexity jumps up in a very steep curve, so be prepared to be sucked into a black hole of many hours on preprocessing natural language. At the end of this chapter, I will also recommend some libraries that will be useful for preprocessing.

主站蜘蛛池模板: 石柱| 石景山区| 贵溪市| 临泽县| 石柱| 云龙县| 德昌县| 镇安县| 通辽市| 湄潭县| 龙州县| 增城市| 连江县| 阿拉善左旗| 五寨县| 台江县| 锦州市| 进贤县| 万源市| 利辛县| 大方县| 永年县| 上饶市| 永靖县| 怀远县| 小金县| 香格里拉县| 于都县| 柏乡县| 龙游县| 龙山县| 那坡县| 麻栗坡县| 同仁县| 阳高县| 双鸭山市| 宣恩县| 隆回县| 万年县| 沧州市| 沙雅县|