官术网_书友最值得收藏!

Understanding the machine learning workflow

As mentioned earlier, machine learning is all about building mathematical models in order to understand data. The learning aspect enters this process when we give a machine learning model the capability to adjust its internal parameters; we can tweak these parameters so that the model explains the data better . In a sense, this can be understood as the model learning from the data. Once the model has learned enough--whatever that means--we can ask it to explain newly observed data.

This process is illustrated in the following figure:

A typical workflow to tackle machine learning problems

Let's break it down step by step.

The first thing to notice is that machine learning problems are always split into (at least) two distinct phases:

  • A training phase, during which we aim to train a machine learning model on a set of data that we call the training dataset
  • A test phase, during which we evaluate the learned (or finalized) machine learning model on a new set of never-before-seen data that we call the test dataset

The importance of splitting our data into a training set and test set cannot be understated. We always evaluate our models on an independent test set because we are interested in knowing how well our models generalize to new data. In the end, isn't this what learning is all about--be it machine learning or human learning? Think back to school when you were a learner yourself: the problems you had to solve as part of your homework would never show up in exactly the same form in the final exam. The same scrutiny should be applied to a machine learning model; we are not so much interested in how well our models can memorize a set of data points (such as a homework problem), but we want to know how our models will use what they have learned to solve new problems (such as the ones that show up in a final exam) and explain new data points.

The workflow of an advanced machine learning problem will typically include a third set of data termed validation dataset. For now, this distinction is not important. A validation set is typically formed by further partitioning the training dataset. It is used in advanced concepts such as model selection, which we will talk about in Chapter 11, Selecting the Right Model with Hyper-Parameter Tuning, when we have become proficient in building machine learning systems.

The next thing to notice is that machine learning is really all about the data. Data enters the previously described workflow diagram in its raw form--whatever that means--and is used in both training and test phases. Data can be anything from images and movies to text documents and audio files. Therefore, in its raw form, data might be made of pixels, letters, words, or even worse: pure bits. It is easy to see that data in such a raw form might not be very convenient to work with. Instead, we have to find ways to preprocess the data in order to bring it into a form that is easy to parse.

Data preprocessing comes in two stages:

  • Feature selection: This is the process of identifying important attributes (or features) in the data. Possible features of an image might be the location of edges, corners, or ridges. You might already be familiar with some more advanced feature descriptors that OpenCV provides, such as speeded up robust features (SURF) or the histogram of oriented gradients (HOG). Although these features can be applied to any image, they might not be that important (or work that well) for our specific task. For example, if our task was to distinguish between clean and dirty water, the most important feature might turn out to be the color of the water, and the use of SURF or HOG features might not help us much.
  • Feature extraction: This is the actual process of transforming the raw data into the desired feature space. An example would be the Harris operator, which allows us to extract corners (that is, a selected feature) in an image.

A more advanced topic is the process of inventing informative features, which is known as feature engineering. After all, before it was possible for people to select from popular features, someone had to invent them first. This is often more important for the success of our algorithm than the choice of the algorithm itself. We will talk about feature engineering extensively in Chapter 4, Representing Data and Engineering Features.

Don't let naming conventions confuse you! Sometimes feature selection and feature extraction are hard to distinguish, mainly because of how stuff is named. For example, SURF stands for both the feature extractor as well as the actual name of the features. The same is true for the s cale-invariant feature transform (SIFT) , which is a feature extractor that yields what is known as SIFT features.

A last point to be made is that in supervised learning, every data point must have a label. A label identifies a data point of either belonging to a certain class of things (such as cat or dog) or of having a certain value (such as the price of a house). At the end of the day, the goal of a supervised machine learning system is to predict the label of all data points in the test set (as shown in the previous figure). We do this by learning regularities in the training data, using the labels that come with it, and then testing our performance on the test set.

Therefore, in order to build a functioning machine learning system, we first have to cover how to load, store, and manipulate data. How do you even do that in OpenCV with Python?

主站蜘蛛池模板: 江城| 神池县| 论坛| 全州县| 通道| 景谷| 盐亭县| 延长县| 革吉县| 城固县| 尖扎县| 仁化县| 宿迁市| 神木县| 永善县| 赣州市| 苏尼特左旗| 宿松县| 普兰县| 民县| 普洱| 旌德县| 福贡县| 东乌珠穆沁旗| 波密县| 淮阳县| 天柱县| 商丘市| 苍南县| 洪洞县| 宜宾市| 称多县| 中卫市| 永年县| 富锦市| 德兴市| 中江县| 都江堰市| 城固县| 康马县| 黔南|