官术网_书友最值得收藏!

Exploratory data analysis 

Let's jump into the data. The LingSpam corpus comes with four variants of the same corpus: bare, lemm, lemm_stop, and stop. In each variant, there are ten parts and each part contains multiple files. Each file represents an email. Files with a spmsg prefix in its name are spam, while the rest are ham. An example email looks as follows (from the bare variant):

Subject: re : 2 . 882 s - > np np
> date : sun , 15 dec 91 02 : 25 : 02 est > from : michael < mmorse @ vm1 . yorku . ca > > subject : re : 2 . 864 queries > > wlodek zadrozny asks if there is " anything interesting " to be said > about the construction " s > np np " . . . second , > and very much related : might we consider the construction to be a form > of what has been discussed on this list of late as reduplication ? the > logical sense of " john mcnamara the name " is tautologous and thus , at > that level , indistinguishable from " well , well now , what have we here ? " . to say that ' john mcnamara the name ' is tautologous is to give support to those who say that a logic-based semantics is irrelevant to natural language . in what sense is it tautologous ? it supplies the value of an attribute followed by the attribute of which it is the value . if in fact the value of the name-attribute for the relevant entity were ' chaim shmendrik ' , ' john mcnamara the name ' would be false . no tautology , this . ( and no reduplication , either . )

Here are some things to note about this particular email:

  • This is an email about linguistics—specifically, about the parsing of a natural sentence into multiple noun phrases (np). This is a largely irrelevant fact to the project at hand. I do, however, think it's a good idea to go through the topics, if only to provide a sanity check on manual occasions.
  • There is an email and a person attached to this emailthe dataset is not particularly anonymized. This has some implications in the future of machine learning, which I will explore in the final chapter of this book.
  • The email is very nicely split into fields (that is, space separated for each word).
  • The email has a Subject line.

The first two points are particularly noteworthy. Sometimes, the subject matter actually matters in machine learning. In our case, we can build our algorithms to be blind—they can be used generically across all emails. But there are times where being context-sensitive will bring new heights to your machine-learning algorithms. The second thing to note is anonymity. We live in an age where software flaws are often the downfall of companies. Doing machine learning on non-anonymous datasets are often fraught with biases. We should try to anonymize data as much as possible.

主站蜘蛛池模板: 磐石市| 花莲市| 卢氏县| 定边县| 通海县| 兴海县| 隆安县| 许昌市| 永宁县| 宁阳县| 舒兰市| 镇安县| 鄂尔多斯市| 高碑店市| 镇平县| 贡嘎县| 海林市| 临沂市| 雷波县| 兴海县| 广西| 磐安县| 封开县| 英吉沙县| 太湖县| 日照市| 钟山县| 广东省| 榆林市| 营口市| 沁源县| 民丰县| 上蔡县| 汶川县| 鲁甸县| 苍山县| 铁岭市| 丹棱县| 沅江市| 抚顺市| 布拖县|