官术网_书友最值得收藏!

Introducing data mining

Data mining provides a way for a computer to learn how to make decisions with data. This decision could be predicting tomorrow's weather, blocking a spam email from entering your inbox, detecting the language of a website, or finding a new romance on a dating site. There are many different applications of data mining, with new applications being discovered all the time.

Data mining is part algorithm design, statistics, engineering, optimization, and computer science. However, combined with these base skills in the area, we also need to apply domain knowledge (expert knowledge)of the area we are applying the data mining. Domain knowledge is critical for going from good results to great results. Applying data mining effectively usually requires this domain-specific knowledge to be integrated with the algorithms.

Most data mining applications work with the same high-level view, where a model learns from some data and is applied to other data, although the details often change quite considerably.

Data mining applications involve creating data sets and tuning the algorithm as explained in the following steps

  1. We start our data mining process by creating a dataset, describing an aspect of the real world. Datasets comprise of the following two aspects:

  • Samples: These are objects in the real world, such as a book, photograph, animal, person, or any other object. Samples are also referred to as observations, records or rows, among other naming conventions.
  • Features: These are descriptions or measurements of the samples in our dataset. Features could be the length, frequency of a specific word, the number of legs on an animal, date it was created, and so on. Features are also referred to as variables, columns, attributes or covariant, again among other naming conventions.
  1. The next step is tuning the data mining algorithm. Each data mining algorithm has parameters, either within the algorithm or supplied by the user. This tuning allows the algorithm to learn how to make decisions about the data.

As a simple example, we may wish the computer to be able to categorize people as short or tall. We start by collecting our dataset, which includes the heights of different people and whether they are considered short or tall:

As explained above, the next step involves tuning the parameters of our algorithm. As a simple algorithm; if the height is more than x, the person is tall. Otherwise, they are short. Our training algorithms will then look at the data and decide on a good value for x. For the preceding data, a reasonable value for this threshold would be 170 cm. A person taller than 170 cm is considered tall by the algorithm. Anyone else is considered short by this measure. This then lets our algorithm classify new data, such as a person with height 167 cm, even though we may have never seen a person with those measurements before.

In the preceding data, we had an obvious feature type. We wanted to know if people are short or tall, so we collected their heights. This feature engineering is a critical problem in data mining. In later chapters, we will discuss methods for choosing good features to collect in your dataset. Ultimately, this step often requires some expert domain knowledge or at least some trial and error.

In this book, we will introduce data mining through Python. In some cases, we choose clarity of code and workflows, rather than the most optimized way to perform every task. This clarity sometimes involves skipping some details that can improve the algorithm's speed or effectiveness.

主站蜘蛛池模板: 北碚区| 壤塘县| 江城| 罗平县| 五台县| 遂平县| 绵阳市| 巧家县| 石河子市| 汾西县| 三门峡市| 黔江区| 德阳市| 胶州市| 郁南县| 平原县| 金川县| 兴仁县| 雅江县| 肥城市| 彭山县| 蒙城县| 克什克腾旗| 莱芜市| 突泉县| 子洲县| 英德市| 温州市| 旬阳县| 新泰市| 呼伦贝尔市| 吉林省| 延庆县| 富平县| 随州市| 灵寿县| 津市市| 灵石县| 梧州市| 阿克苏市| 南召县|