官术网_书友最值得收藏!

Semi-supervised learning algorithms

A semi-supervised scenario can be considered as a standard supervised one that exploits some features belonging to unsupervised learning techniques. A very common problem, in fact, arises when it's easy to obtain large unlabeled datasets but the cost of labeling is very high. Hence, it's reasonable to label only a fraction of the samples and to propagate the labels to all unlabeled ones whose distance from a labeled sample is below a predefined threshold. If the dataset has been drawn from a single data generating process and the labeled samples are uniformly distributed, a semi-supervised algorithm can achieve an accuracy comparable with a supervised one. In this book, we are not discussing these algorithms; however, it's helpful to briefly introduce two very important models:

  • Label propagation
  • Semi-supervised Support Vector Machines

The first one is called label propagation and its goal is to propagate the labels of a few samples to a larger population. This goal is achieved by considering a graph where each vertex represents a sample and every edge is weighted using a distance function. Through an iterative procedure, all labeled samples will send a fraction of their label values to all their neighbors and the process is repeated until the labels stop changing. This system has a stable point (that is, a configuration that cannot evolve anymore) and the algorithm can easily reach it with a limited number of iterations.

Label propagation is extremely helpful in all those contexts where some samples can be labeled according to a similarity measure. For example, an online store could have a large base of customers, but only 10% have disclosed their gender. If the feature vectors are rich enough to represent the common behavior of male and female users, it's possible to employ the label propagation algorithm to guess the gender of customers who haven't disclosed it. Of course, it's important to remember that all the assignments are based on the assumption that similar samples have the same label. This can be true in many situations, but it can also be misleading when the complexity of the feature vectors increases.

Another important family of semi-supervised algorithms is based on the extension of standard SVM, (short for Support Vector Machine) to datasets containing unlabeled samples. In this case, we don't want to propagate existing labels, but rather the classification criterion. In other words, we want to train the classifier using the labeled dataset and extend the discriminative rule to the unlabeled samples as well.

Contrary to the standard procedure that can only evaluate unlabeled samples, a semi-supervised SVM uses them to correct the separating hyperplane. The assumption is always based on the similarity: if A has label 1 and the unlabeled sample B has d(A, B) < ε (where ε is a predefined threshold), it's reasonable to assume that the label of B is also 1. In this way, the classifier can achieve high accuracy on the whole dataset even if only a subset has been manually labeled. Similar to label propagation, these kinds of model are reliable only when the structure of the dataset is not extremely complex and, in particular, when the similarity assumption holds (unfortunately there are some cases where it's extremely difficult to find a suitable distance metric, hence many similar samples are indeed dissimilar and vice versa).

主站蜘蛛池模板: 梧州市| 舟曲县| 明光市| 东兰县| 建始县| 余干县| 永清县| 玉门市| 竹溪县| 临清市| 明溪县| 长白| 岑巩县| 宁安市| 荃湾区| 南澳县| 开鲁县| 砀山县| 惠州市| 新津县| 婺源县| 吉隆县| 郎溪县| 遵义县| 读书| 东莞市| 屏东县| 中西区| 当涂县| 花莲县| 南郑县| 河北省| 鲁甸县| 绿春县| 镇巴县| 宣恩县| 永吉县| 安化县| 黑水县| 信阳市| 云和县|