官术网_书友最值得收藏!

K-means++

Finding the optimal initial configuration is equivalent to minimizing the inertia; however, Arthur and Vassilvitskii (in K-means++: The Advantages of Careful Seeding, Arthur D., Vassilvitskii S., Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, 2007) have proposed an alternative initialization method (called K-means++), which can dramatically improve the convergence speed by choosing initial centroids with a higher probability of being close to the final ones. The complete proof is quite complex and can be found in the aforementioned paper. In this context, we are providing directly the final results and some important consequences.

Let's consider the function D(?) defined as:

D(?) represents the shortest distance between a sample x ∈ X and a centroid already selected. Once the function has been computed, it's possible to determine a probability distribution G(x) as follows:

The first centroid μ1 is sampled from a uniform distribution. At this point, it's possible to compute D(?) for all samples x ∈ X and, therefore, the distribution G(x). It's straightforward that if we sample from G(x), the probability of selecting a value in a dense region is much larger than the probability of uniform sampling or of picking a centroid in a separating region. Hence, we continue by sampling μ2 from G(x). The procedure is repeated until all K centroids have been determined. Of course, as this is a probabilistic approach, we have no guarantee that the final configuration is optimal. However, the employment of K-means++ is O(log K)-competitive. In fact, if Sopt is the theoretical optimal value for S, the authors proved that the following inequality holds:

As S is decreased by a better choice, the previous formula sets an upper bound for the expected value E[S] roughly proportional to log K. For example, for K=10, E[S]  19.88Sopt and E[S]  12.87Sopt for K=3. This result reveals two important elements. The first one is that K-means++ performs better when K is not extremely large and the second, probably also the most important, is that a single K-means++ initialization cannot be enough to obtain the optimal configuration. For this reason, common implementations (for example, scikit-learn) performs a variable number of initializations and select the one whose initial inertia is the smallest.

主站蜘蛛池模板: 宝山区| 赫章县| 崇仁县| 宜州市| 武陟县| 和林格尔县| 瓦房店市| 门头沟区| 武定县| 桑植县| 寿光市| 屯门区| 五华县| 调兵山市| 崇州市| 富民县| 商都县| 孟州市| 临泽县| 中超| 古丈县| 彰化县| 依安县| 泗水县| 双峰县| 隆化县| 丰城市| 商水县| 镇安县| 兴化市| 平安县| 南木林县| 甘泉县| 名山县| 龙里县| 凤山市| 康马县| 工布江达县| 靖边县| 泸西县| 华蓥市|