書名： Hands-On Machine Learning with JavaScript
作者名： Burak Kanber
本章字數： 1064字
更新時間： 2021-06-25 21:38:20

An overview

One misconception I would like to dispel early on is that implementing the ML algorithm itself is the bulk of the work you'll need to do to accomplish some task. If you're new to this, you may be under the impression that 95% of your time should be spent on implementing a neural network, and that the neural network is solely responsible for the results you get. Build a neural network, put data in, magically get results out. What could be easier?

The reality of ML is that the algorithm you use is only as good as the data you put into it. Furthermore, the results you get are only as good as your ability to process and interpret them. The age-old computer science acronym GIGO fits well here: Garbage In, Garbage Out.

When implementing ML techniques, you must also pay close attention to their preprocessing and postprocessing of data. Data preprocessing is required for many reasons, and is the focus of this chapter. Postprocessing relates to your interpretation of the algorithm's output, whether your confidence in the algorithm's result is high enough to take action on it, and your ability to apply the results to your business problem. Since postprocessing of results strongly depends on the algorithm in question, we'll address postprocessing considerations as they come up in our specific examples throughout this book.

Preprocessing of data, like postprocessing of data, often depends on the algorithm used, as different algorithms have different requirements. One straightforward example is image processing with Convolutional Neural Networks (CNNs), covered in a later chapter. All images processed by a single CNN are expected to have the same dimensions, or at least the same number of pixels and the same number of color channels (RGB versus RGBA versus grayscale, and so on). The CNN was configured to expect a specific number of inputs, and so every image you give to it must be preprocessed to make sure it complies with the neural network's expectations. You may need to resize, scale, crop, or pad input images before feeding them to the network. You may need to convert color images to grayscale. You may need to detect and remove images that have been corrupted from your dataset.

Some algorithms simply won't work if you attempt to give them the wrong input. If a CNN expects 10,000 grayscale pixel intensity inputs (namely an image that's 100 x 100 pixels), there's no way you can give it an image that's sized 150 x 200. This is a best-case scenario for us: the algorithm fails loudly, and we are able to change our approach before attempting to use our network.

Other algorithms, however, will fail silently if you give them bad input. The algorithm will appear to be working, and even give you results that look reasonable but are actually wholly inaccurate. This is our worst-case scenario: we think the algorithm is working as expected, but in reality we're in a GIGO situation. Just think about how long it will take you to discover that the algorithm is actually giving you nonsensical results. How many bad business decisions have you made based on incorrect analysis or poor data? These are the types of situations we must avoid, and it all starts at the beginning: making sure the data we use is appropriate for the application.

Most ML algorithms make assumptions about the data they process. Some algorithms expect data to be of a given size and shape (as in neural networks), some algorithms expect data to be bucketed, some algorithms expect data to be normalized over a range (between 0 and 1 or between -1 and +1), some algorithms are resilient to missing values and others aren't. It is ultimately your responsibility to understand what assumptions the algorithm makes about your data, and also to align the data with the expectations of the algorithm.

For the most part, the aforementioned relates to the format, shape, and size of data. There is another consideration: the quality of the data. A data point may be perfectly formatted and aligned with the expectations of an algorithm, but still be wrong. Perhaps someone wrote down the wrong value for a measurement, maybe there was an instrumentation failure, or maybe some environmental effect has contaminated or tainted your data. In these cases the format, shape, and size may be correct, but the data itself may harm your model and prevent it from converging on a stable or accurate result. In many of these cases, the data point in question is an outlier, or a data point that doesn't seem to fit within the set.

Outliers exist in real life, and are often valid data. It's not always apparent by looking at the data by itself whether an outlier is valid or not, and we must also consider the context and algorithm when determining how to handle the data. For instance, let's say you're running a meta-analysis that relates patients' height to their heart performance and you've got 100 medical records available to analyze. One of the patients is listed with a height of 7'3" (221 cm). Is this a typo? Did the person who recorded the data actually mean 6'3" (190 cm)? What are the odds that, of only 100 random inpiduals, one of them is actually that tall? Should you still use this data point in your analysis, even though it will skew your otherwise very clean-looking results? What if the sample size were 1 million records instead of only 100? In that case, it's much more likely that you did actually select a very tall person. What if the sample size were only 100, but they were all NBA players?

As you can see, dealing with outliers is not straightforward. You should always be hesitant to discard data, especially if in doubt. By discarding data, you run the risk of creating a self-fulfilling prophecy by which you've consciously or subconsciously selected only the data that will support your hypothesis, even if your hypothesis is wrong. On the other hand, using legitimately bad data can ruin your results and prevent progress.

In this chapter, we will discuss a number of different considerations you must make when preprocessing data, including data transformations, handling missing data, selecting the correct parameters, handling outliers, and other forms of analysis that will be helpful in the data preprocessing stage.

官术网_书友最值得收藏!

Hands-On Machine Learning with JavaScript

An overview