官术网_书友最值得收藏!

Components of an ML solution

So far, we've discussed three major classes of machine learning algorithms. However, to solve an ML problem, we'll need a system in which the ML algorithm is only part of it. The most important aspects of such a system are as follows:

  • Learner: This is algorithm is used with its learning philosophy. The choice of this algorithm is determined by the problem we're trying to solve, since different problems can be better suited for certain machine learning algorithms.
  • Training data: This is the raw dataset that we are interested in. This can be labeled or unlabeled. It's important to have enough sample data for the learner to understand the structure of the problem.
  • Representation: This is how we express the data in terms of the chosen features, so that we can feed it to the learner. For example, to classify handwritten images of digits, we'll represent the image as an array of values, where each cell will contain the color value of one pixel. A good choice of representation of the data is important for achieving better results.
  • Goal: This represents the reason to learn from the data for the problem at hand. This is strictly related to the target, and helps define how and what the learner should use and what representation to use. For example, the goal may be to clean our mailbox from unwanted emails, and this goal defines what the target of our learner is. In this case, it is the detection of spam emails.
  • Target: This represents what is being learned as well as the final output. The target can be a classification of unlabeled data, a representation of input data according to hidden patterns or characteristics, a simulator for future predictions, or a response to an outside stimulus or strategy (in the case of reinforcement learning).

It can never be emphasized enough: any machine learning algorithm can only achieve an approximation of the target and not a perfect numerical description. Machine learning algorithms are not exact mathematical solutions to problems, they are just approximations. In the previous paragraph, we defined learning as a function from the space of features (the input) into a range of classes. We'll later see how certain machine learning algorithms, such as neural networks, can approximate any function to any degree, in theory. This theorem is called the Universal Approximation Theorem, but it does not imply that we can get a precise solution to our problem. In addition, solutions to the problem can be better achieved by better understanding the training data.

Typically, a problem that is solvable with classic machine learning techniques may require a thorough understanding and processing of the training data before deployment. The steps to solve an ML problem are as follows:

  • Data collection: This implies the gathering of as much data as possible. In the case of supervised learning, this also includes correct labeling.
  • Data processing: This implies cleaning the data, such as removing redundant or highly correlated features, or even filling missing data, and understanding the features that define the training data.
  • Creation of the test case: Usually, the data can be divided into three sets:
    • Training set: We use this set to train the ML algorithm.
    • Validation set: We use this set to evaluate the accuracy of the algorithm with unknown data during training. We'll train the algorithm for some time on the training set and then we'll use the validation set to check its performance. If we are not satisfied with the result, we can tune the hyperparameters of the algorithm and repeat the process again. The validation set can also help us to determine when to stop the training. We'll learn more about this later in this section.
    • Test set: When we finish tuning the algorithm with the training or validation cycle, we'll use the test set only once for a final evaluation. The test set is similar to the validation set in the sense that the algorithm hasn't used it during training. However, when we strive to improve the algorithm on the validation data, we may inadvertently introduce bias, which can skew the results in favor of the validation set and not reflect the actual performance. Because we use the test only once, this will provide a more objective measurement of the algorithm.
One of the reasons for the success of deep learning algorithms is that they usually require less data processing than classic methods. For a classic algorithm, you would have to apply different data processing and extract different features for each problem. With DL, you can apply the same data processing pipeline for most tasks. With DL, you can be more productive and you don't need as much domain knowledge for the task at hand compared to the classic ML algorithms.

There are many valid reasons to create testing and validation datasets. As mentioned, machine learning techniques can only produce an approximation of the desired result. Often, we can only include a finite and limited number of variables, and there may be many variables that are outside of our control. If we only used a single dataset, our model may end up memorizing the data, and producing an extremely high accuracy value on the data it has memorized. However, this result may not be reproducible on other similar but unknown datasets. One of the key goals of machine learning algorithms is their ability to generalize. This is why we create both, a validation set used for tuning our model selection during training, and a final test set only used at the end of the process to confirm the validity of the selected algorithm.

To understand the importance of selecting valid features and to avoid memorizing the data, which is also referred to as overfitting in the literature-and we'll use that term from now on-let's use a joke taken from an xkcd comic as an example (http://xkcd.com/1122):

 "Up until 1996, no democratic US presidential candidate who was an incumbent and with no combat experience had ever beaten anyone whose first name was worth more in Scrabble."

It's apparent that such a rule is meaningless, but it underscores the importance of selecting valid features and the question, "how much is a name worth in Scrabble," can bear any relevance while selecting a US president? Also, this example doesn't have any predictive power over unknown data. We'll call this overfitting, which refers to making predictions that fit the data at hand perfectly, but don't generalize to larger datasets. Overfitting is the process of trying to make sense of what we'll call noise (information that does not have any real meaning) and trying to fit the model to small perturbations.

To further explain this, let's try to use machine learning to predict the trajectory of a ball thrown from the ground up into the air (not perpendicularly) until it reaches the ground again. Physics teaches us that the trajectory is shaped as a parabola. We also expect that a good machine learning algorithm observing thousands of such throws would come up with a parabola as a solution. However, if we were to zoom into the ball and observe the smallest fluctuations in the air due to turbulence, we might notice that the ball does not hold a steady trajectory but may be subject to small perturbations, which in this case is the noise. A machine learning algorithm that tries to model these small perturbations would fail to see the big picture and produce a result that is not satisfactory. In other words, overfitting is the process that makes the machine learning algorithm see the trees, but forgets about the forest:

A good prediction model versus a bad (overfitted) prediction model, with the trajectory of a ball thrown from the ground 

This is why we separate the training data from the validation and test data; if the accuracy on the test data was not similar to the training data accuracy, that would be a good indication that the model overfits. We need to make sure that we don't make the opposite error either, that is, underfitting the model. In practice though, if we aim to make our prediction model as accurate as possible on our training data, underfitting is much less of a risk, and care is taken to avoid overfitting.

The following image depicts underfitting:

Underfitting can be a problem as well
主站蜘蛛池模板: 吉木萨尔县| 保康县| 长乐市| 岢岚县| 祁连县| 柳河县| 文昌市| 卓资县| 封丘县| 邹平县| 阆中市| 巴彦淖尔市| 三门峡市| 青州市| 丹巴县| 陇西县| 咸阳市| 澎湖县| 武清区| 明星| 达孜县| 闽侯县| 南漳县| 巴林右旗| 宁安市| 克什克腾旗| 通州区| 边坝县| 无棣县| 河曲县| 安康市| 会昌县| 白水县| 宁都县| 新丰县| 安吉县| 庆云县| 江口县| 灵武市| 多伦县| 黑河市|