官术网_书友最值得收藏!

Cluster analysis

Cluster analysis (normally called just clustering) is an example of a task where we want to find out common features among large sets of samples. In this case, we always suppose the existence of a data generating process  and we define the dataset X as:

A clustering algorithm is based on the implicit assumption that samples can be grouped according to their similarities. In particular, given two vectors, a similarity function is defined as the reciprocal or inverse of a metric function. For example, if we are working in a Euclidean space, we have:

In the previous formula, the constant ε has been introduced to avoid division by zero. It's obvious that d(a, c) < d(a, b) ? s(a, c) > s(a, b). Therefore, given a representative of each cluster , we can create the set of assigned vectors considering the rule:

In other words, a cluster contains all those elements whose distance from the representative is minimum compared to all other representatives. This implies that a cluster contains samples whose similarity with the representative is maximal compared to all representatives. Moreover, after the assignment, a sample gains the right to share its feature with the other members of the same cluster.

In fact, one of the most important applications of cluster analysis is trying to increase the homogeneity of samples that are recognized as similar. For example, a recommendation engine could be based on the clustering of the user vectors (containing information about their interests and bought products). Once the groups have been defined, all the elements belonging to the same cluster are considered as similar, hence we are implicitly authorized to share the differences. If user A has bought the product P and rated it positively, we can suggest this item to user B who didn't buy it and the other way around. The process can appear arbitrary, but it turns out to be extremely effective when the number of elements is large and the feature vectors contain many discriminative elements (for example, ratings).

主站蜘蛛池模板: 南平市| 邻水| 醴陵市| 融水| 遵化市| 五家渠市| 阿图什市| 石棉县| 蓬安县| 特克斯县| 尖扎县| 南康市| 绍兴市| 乐亭县| 仁寿县| 长子县| 武穴市| 汤原县| 科技| 托克托县| 内黄县| 北海市| 边坝县| 洛扎县| 宣化县| 安康市| 贵阳市| 塔城市| 红安县| 芜湖县| 高雄市| 台山市| 六安市| 增城市| 漠河县| 阿克陶县| 兴文县| 五原县| 临夏县| 秭归县| 温宿县|