- Go Machine Learning Projects
- Xuanyi Chew
- 515字
- 2021-06-10 18:46:38
Exploratory data analysis
Let's jump into the data. The LingSpam corpus comes with four variants of the same corpus: bare, lemm, lemm_stop, and stop. In each variant, there are ten parts and each part contains multiple files. Each file represents an email. Files with a spmsg prefix in its name are spam, while the rest are ham. An example email looks as follows (from the bare variant):
Subject: re : 2 . 882 s - > np np
> date : sun , 15 dec 91 02 : 25 : 02 est > from : michael < mmorse @ vm1 . yorku . ca > > subject : re : 2 . 864 queries > > wlodek zadrozny asks if there is " anything interesting " to be said > about the construction " s > np np " . . . second , > and very much related : might we consider the construction to be a form > of what has been discussed on this list of late as reduplication ? the > logical sense of " john mcnamara the name " is tautologous and thus , at > that level , indistinguishable from " well , well now , what have we here ? " . to say that ' john mcnamara the name ' is tautologous is to give support to those who say that a logic-based semantics is irrelevant to natural language . in what sense is it tautologous ? it supplies the value of an attribute followed by the attribute of which it is the value . if in fact the value of the name-attribute for the relevant entity were ' chaim shmendrik ' , ' john mcnamara the name ' would be false . no tautology , this . ( and no reduplication , either . )
Here are some things to note about this particular email:
- This is an email about linguistics—specifically, about the parsing of a natural sentence into multiple noun phrases (np). This is a largely irrelevant fact to the project at hand. I do, however, think it's a good idea to go through the topics, if only to provide a sanity check on manual occasions.
- There is an email and a person attached to this email—the dataset is not particularly anonymized. This has some implications in the future of machine learning, which I will explore in the final chapter of this book.
- The email is very nicely split into fields (that is, space separated for each word).
- The email has a Subject line.
The first two points are particularly noteworthy. Sometimes, the subject matter actually matters in machine learning. In our case, we can build our algorithms to be blind—they can be used generically across all emails. But there are times where being context-sensitive will bring new heights to your machine-learning algorithms. The second thing to note is anonymity. We live in an age where software flaws are often the downfall of companies. Doing machine learning on non-anonymous datasets are often fraught with biases. We should try to anonymize data as much as possible.
- 我的J2EE成功之路
- 教父母學會上網
- 并行數據挖掘及性能優化:關聯規則與數據相關性分析
- MicroPython Projects
- 機器學習流水線實戰
- Google SketchUp for Game Design:Beginner's Guide
- SMS 2003部署與操作深入指南
- 網絡脆弱性掃描產品原理及應用
- HBase Essentials
- Mastering DynamoDB
- 從機器學習到無人駕駛
- NetSuite ERP for Administrators
- Microsoft Office 365:Exchange Online Implementation and Migration(Second Edition)
- Microsoft Power BI Complete Reference
- 華人動畫師的法蘭西印象