官术网_书友最值得收藏!

Homogeneity score

The homogeneity score is complementary to the previous one and it's based on the assumption that a cluster must contain only samples having the same true label. It is defined as:

Analogously to the completeness score, when H(Ytrue|Ypred) → H(Ytrue), it means that the assignments have no impact on the conditional entropy, hence the uncertainty is not reduced after the clustering (for example, every cluster contains samples belonging to all classes) and → 0. Conversely, when H(Ytrue|Ypred) → 0, h → 1, because knowledge of the predictions has reduced the uncertainty about the true assignments and the clusters contain almost exclusively samples with the same label. It's important to remember that this score alone is not enough, because it doesn't guarantee that a cluster contains all samples xi ∈ X with the same true label. That's why the homogeneity score is always evaluated together with the completeness score.

For the Breast Cancer Wisconsin dataset and K=2, we obtain the following:

from sklearn.metrics import homogeneity_score

print('Homogeneity: {}'.format(homogeneity_score(kmdff['diagnosis'], kmdff['prediction'])))

The corresponding output is as follows:

Homogeneity: 0.42229071246999117

This value (in particular, for K=2) confirms our initial analysis. At least one cluster (the one with the majority of benign samples) is not completely homogeneous, because it contains samples belonging to both classes. However, as the value is not very close to 0, we can be sure that the assignments are partially correct. Considering both values, h and c, we can deduct that K-means is not performing extremely well (probably because of non-convexity), but, at the same time, it's able to separate correctly all those samples whose nearest cluster distance is above a specific threshold. It goes without saying that, with knowledge of the ground truth, we cannot easily accept K-means and we should look for another algorithm that is able to yield both h and c → 1.

主站蜘蛛池模板: 繁峙县| 定安县| 扎赉特旗| 辽阳市| 盖州市| 栾城县| 大新县| 高碑店市| 锦屏县| 阜新市| 罗田县| 赞皇县| 湘潭县| 九江县| 多伦县| 佛山市| 迁西县| 许昌县| 隆子县| 石景山区| 木兰县| 武鸣县| 左贡县| 科技| 洪泽县| 玉树县| 观塘区| 伊吾县| 道孚县| 会泽县| 陕西省| 丹棱县| 黄冈市| 新干县| 教育| 新乐市| 池州市| 梨树县| 南涧| 布拖县| 慈溪市|