官术网_书友最值得收藏!

The project 

What we want to do is simple: given an email, is it kosher (which we call ham), or is it a spam email? We will be using the LingSpam database. The emails from that database are a little dated—spammers update their techniques and words all the time. However, I chose the LingSpam corpus for a good reason: it is already nicely preprocessed. The original scope of this chapter was to introduce the preprocessing of emails; however, the topic of preprocessing options for natural language is itself a topic for an entire book, so we will use a dataset that has already been preprocessed. This allows us to focus more on the mechanics of a very elegant algorithm.

Fear not, though, as I will actually walk through the brief basics of preprocessing. Be warned, however, that the level of complexity jumps up in a very steep curve, so be prepared to be sucked into a black hole of many hours on preprocessing natural language. At the end of this chapter, I will also recommend some libraries that will be useful for preprocessing.

主站蜘蛛池模板: 读书| 沙洋县| 肇源县| 铁岭市| 綦江县| 灵丘县| 全椒县| 阳山县| 和平区| 漠河县| 曲沃县| 大田县| 兴城市| 富裕县| 沙洋县| 宜春市| 绥宁县| 浦县| 丰都县| 雅江县| 酉阳| 侯马市| 梅河口市| 江阴市| 马鞍山市| 五大连池市| 延津县| 万州区| 同仁县| 巴青县| 龙岩市| 敦化市| 高台县| 德昌县| 临沂市| 泉州市| 凤冈县| 吉林省| 开阳县| 洪洞县| 台山市|