官术网_书友最值得收藏!

The curse of dimensionality

In ML applications, we often have high-dimensional data. If we're recording 50 different metrics for each of our shoppers, we're working in a space with 50 dimensions. If we're analyzing grayscale images sized 100 x 100, we're working in a space with 10,000 dimensions. If the images are RGB-colored, the dimensionality increases to 30,000 dimensions (one dimension for each color channel in each pixel in the image)!

This problem is called the curse of dimensionality. On one hand, ML excels at analyzing data with many dimensions. Humans are not good at finding patterns that may be spread out across so many dimensions, especially if those dimensions are interrelated in counter-intuitive ways. On the other hand, as we add more dimensions we also increase the processing power we need to analyze the data, and we also increase the amount of training data required to make meaningful models.

One area that clearly demonstrates the curse of dimensionality is natural language processing (NLP). Imagine you are using a Bayesian classifier to perform sentiment analysis of tweets relating to brands or other topics. As you will learn in a later chapter, part of data preprocessing for NLP is tokenization of input strings into n-grams, or groups of words. Those n-grams are the features that are given to the Bayesian classifier algorithm.

Consider a few input strings: I love cheese, I like cheese, I hate cheese, I don't love cheese, I don't really like cheese. These examples are straightforward to us, since we've been using natural language our entire lives. How would an algorithm view these examples, though? If we are doing a 1-gram or unigram analysis—meaning that we split the input string into inpidual words—we see love in the first example, like in the second, hate in the third, love in the fourth, and like in the fifth. Our unigram analysis may be accurate for the first three examples, but it fails on the fourth and fifth because it does not learn that don't love and don't really like are coherent statements; the algorithm is only looking at the effects of inpidual words. This algorithm runs very quickly and requires little storage space, because in the preceding example there are only seven unique words used in the four phrases above (I, love, cheese, like, hate, don't, and really).

You may then modify the tokenization preprocessing to use bigrams, or 2-grams—or groups of two words at a time. This increases the dimensionality of our data, requiring more storage space and processing time, but also yields better results. The algorithm now sees dimensions like I love and love cheese, and can now also recognize that don't love is different from I love. Using the bigram approach the algorithm may correctly identify the sentiment of the first four examples but still fail for the fifth, which is parsed as I don't, don't really, really like, and like cheese. The classification algorithm will see really like and like cheese, and incorrectly relate that to the positive sentiment in the second example. Still, the bigram approach is working for 80% of our examples.

You might now be tempted to upgrade the tokenization once more to capture trigrams, or groups of three words at a time. Instead of getting an increase in accuracy, the algorithm takes a nosepe and is unable to correctly identify anything. We now have too many dimensions in our data. The algorithm learns what I love cheese means, but no other training example includes the phrase I love cheese so that knowledge can't be applied in any way. The fifth example parses into the trigrams I don't really, don't really like, and really like cheese—none of which have ever been encountered before! This algorithm ends up giving you a 50% sentiment for every example, because there simply isn't enough data in the training set to capture all of the relevant combinations of trigrams.

This is the curse of dimensionality at play: the trigram approach may indeed give you better accuracy than the bigram approach, but only if you have a huge training set that provides data on all the different possible combinations of three words at a time. You also now need a tremendous amount of storage space because there are a much larger number of combinations of three words than there are of two words. Choosing the preprocessing approach will therefore depend on the context of the problem, the computing resources available, and also the training data available to you. If you have a lot of training data and tons of resources, the trigram approach may be more accurate, but in more realistic conditions, the bigram approach may be better overall, even if it does misclassify some tweets.

The preceding discussion relates to the concepts of feature selection, feature extraction, and dimensionality. In general, our goal is to select only relevant features (ignore shopper trends that aren't interesting to us), extract or derive features that better represent our data (by using facial measurements rather than photograph pixels), and ultimately reduce dimensionality such that we use the fewest, most relevant dimensions we can.

主站蜘蛛池模板: 准格尔旗| 太和县| 平安县| 江孜县| 清河县| 蒙阴县| 深圳市| 鄢陵县| 仪征市| 象山县| 犍为县| 凌云县| 吉木萨尔县| 蕲春县| 延边| 灯塔市| 通许县| 南澳县| 吴旗县| 卢龙县| 襄垣县| 苍南县| 农安县| 盘山县| 化隆| 洛浦县| 迁安市| 斗六市| 留坝县| 康平县| 宿迁市| 安泽县| 大化| 荃湾区| 晴隆县| 荥阳市| 五大连池市| 阿巴嘎旗| 门源| 沙坪坝区| 全椒县|