官术网_书友最值得收藏!

Supervised learning

Supervised learning algorithms are a class of machine learning algorithms that use previously-labeled data to learn its features, so they can classify similar but unlabeled data. Let's use an example to understand this concept better.

Let's assume that a user receives a large amount of emails every day, some of which are important business emails and some of which are unsolicited junk emails, also known as spam. A supervised machine algorithm will be presented with a large body of emails that have already been labeled by a teacher as spam or not spam (this is called training data). For each sample, the machine will try to predict whether the email is spam or not, and it will compare the prediction with the original target label. If the prediction differs from the target, the machine will adjust its internal parameters in such a way that the next time it encounters this sample it will classify it correctly. Conversely, if the prediction was correct, the parameters will stay the same. The more training data we feed to the algorithm, the better it becomes (this rule has caveats, as we'll see next).

In the example we used, the emails had only two classes (spam or not spam), but the same principles apply for tasks with arbitrary numbers of classes. For example, we could train the software on a set of labeled emails where the classes are Personal, Business/Work, Social, or Spam.

In fact, Gmail, the free email service by Google, allows the user to select up to five categories, which are labeled as the following:

  • Primary: Includes person-to-person conversations
  • Social: Includes messages from social networks and media-sharing sites
  • Promotions: Includes marketing emails, offers, and discounts
  • Updates: Includes bills, bank statements, and receipts
  • Forums: Includes messages from online groups and mailing lists

In some cases, the outcome may not necessarily be discrete, and we may not have a finite number of classes to classify our data into. For example, we may try to predict the life expectancy of a group of people based on their predetermined health parameters. In this case, the outcome is a continuous function, that is, the number years the person is expected to live, and we don't talk about classification but rather regression.

One way to think of supervised learning is to imagine we are building a function, fdefined over a dataset, which comprises information organized by features. In the case of email classification, the features can be specific words that may appear more frequently than others in spam emails. The use of explicit sex-related words will most likely identify a spam email rather than a business/work email. On the contrary, words such as meeting, business, or presentation are more likely to describe a work email. If we have access to metadata, we may also use the sender's information as a feature. Each email will then have an associated set of features, and each feature will have a value (in this case, how many times the specific word is present in the email body). The machine learning algorithm will then seek to map those values to a discrete range that represents the set of classes, or a real value in the case of regression. The definition of the f function is as follows:

In later chapters, we'll see several examples of either classification or regression problems. One such problem we'll discuss is the classification of handwritten digits (the famous Modified National Institute of Standards and Technology, or MNIST, database). When given a set of images representing 0 to 9, the machine learning algorithm will try to classify each image in one of the 10 classes, wherein each class corresponds to one of the 10 digits. Each image is 28x28 (= 784) pixels in size. If we think of each pixel as one feature, then the algorithm will use a 784-dimensional feature space to classify the digits.

The following screenshot depicts the handwritten digits from the MNIST dataset:

Example of handwritten digits from the MNIST dataset

In the next sections, we'll talk about some of the most popular classical supervised algorithms. The following is by no means an exhaustive list or a thorough description of each machine learning method. We can refer to the book Python Machine Learning by Sebastian Raschka (https://www.packtpub.com/big-data-and-business-intelligence/python-machine-learning). It's a simple review meant to provide the reader with a flavor of the different techniques. Also, at the end of this chapter in the Neural networks section, we'll introduce neural networks and we'll talk about how deep learning differs from the classical machine learning techniques.

主站蜘蛛池模板: 京山县| 西安市| 阳谷县| 高淳县| 博罗县| 白朗县| 凉城县| 宜兴市| 潍坊市| 蕲春县| 禹城市| 淮安市| 东丽区| 商南县| 隆昌县| 广德县| 邹城市| 日照市| 都昌县| 天长市| 抚远县| 涡阳县| 南部县| 大化| 龙山县| 盘锦市| 黄大仙区| 资源县| 繁昌县| 子洲县| 凤冈县| 鹤岗市| 陆良县| 沁源县| 永嘉县| 玉林市| 崇信县| 涞水县| 彭泽县| 蓬莱市| 阿图什市|