- Go Machine Learning Projects
- Xuanyi Chew
- 515字
- 2021-06-10 18:46:38
Exploratory data analysis
Let's jump into the data. The LingSpam corpus comes with four variants of the same corpus: bare, lemm, lemm_stop, and stop. In each variant, there are ten parts and each part contains multiple files. Each file represents an email. Files with a spmsg prefix in its name are spam, while the rest are ham. An example email looks as follows (from the bare variant):
Subject: re : 2 . 882 s - > np np
> date : sun , 15 dec 91 02 : 25 : 02 est > from : michael < mmorse @ vm1 . yorku . ca > > subject : re : 2 . 864 queries > > wlodek zadrozny asks if there is " anything interesting " to be said > about the construction " s > np np " . . . second , > and very much related : might we consider the construction to be a form > of what has been discussed on this list of late as reduplication ? the > logical sense of " john mcnamara the name " is tautologous and thus , at > that level , indistinguishable from " well , well now , what have we here ? " . to say that ' john mcnamara the name ' is tautologous is to give support to those who say that a logic-based semantics is irrelevant to natural language . in what sense is it tautologous ? it supplies the value of an attribute followed by the attribute of which it is the value . if in fact the value of the name-attribute for the relevant entity were ' chaim shmendrik ' , ' john mcnamara the name ' would be false . no tautology , this . ( and no reduplication , either . )
Here are some things to note about this particular email:
- This is an email about linguistics—specifically, about the parsing of a natural sentence into multiple noun phrases (np). This is a largely irrelevant fact to the project at hand. I do, however, think it's a good idea to go through the topics, if only to provide a sanity check on manual occasions.
- There is an email and a person attached to this email—the dataset is not particularly anonymized. This has some implications in the future of machine learning, which I will explore in the final chapter of this book.
- The email is very nicely split into fields (that is, space separated for each word).
- The email has a Subject line.
The first two points are particularly noteworthy. Sometimes, the subject matter actually matters in machine learning. In our case, we can build our algorithms to be blind—they can be used generically across all emails. But there are times where being context-sensitive will bring new heights to your machine-learning algorithms. The second thing to note is anonymity. We live in an age where software flaws are often the downfall of companies. Doing machine learning on non-anonymous datasets are often fraught with biases. We should try to anonymize data as much as possible.
- 電氣自動化專業英語(第3版)
- 亮劍.NET:.NET深入體驗與實戰精要
- 輕松學C語言
- 火格局的時空變異及其在電網防火中的應用
- Mastering Spark for Data Science
- 工業機器人產品應用實戰
- Julia 1.0 Programming
- 2018西門子工業專家會議論文集(上)
- 80x86/Pentium微型計算機原理及應用
- Apache Spark Deep Learning Cookbook
- 計算機網絡技術基礎
- Embedded Programming with Modern C++ Cookbook
- 統計學習理論與方法:R語言版
- 數據庫系統原理及應用教程(第5版)
- Python:Data Analytics and Visualization