- Hands-On Machine Learning with Microsoft Excel 2019
- Julio Cesar Rodriguez Martino
- 334字
- 2021-06-24 15:11:03
Understanding unsupervised learning with clustering
Clustering is a statistical method that attempts to group the points in a dataset according to a distance measure, usually the Euclidean distance, which calculates the root of the squared differences between coordinates of a pair of points. To put this simply, those points that are classified within the same cluster are closer (in terms of the distance defined) to each other than they are to the points belonging to other clusters. At the same time, the larger the distance between two clusters, the better we can distinguish them. This is similar to saying that we try to build groups in which members are more alike and are more different to members of other groups.
It is clear that the most important part of a clustering algorithm is to define and calculate the distance between two given points and to iteratively assign the points to the defined clusters, until there is no change in the cluster composition.
There are a few points to consider before trying a clustering analysis. Not every type of data is adequate for clustering. For example, we cannot use binary data since it is not possible to define distances. The values are either 1 or 0, and there is no value in-between. This excludes the type of data generated by one-hot encoding. Only data that shows some ordering or scale is useful for clustering. Even if the data values are real (such as, for example, a client's expenditure amounts or annual income), it is better to group them in a scale of ranges.
Some examples of clustering use cases are as follows:
- Automatic grouping of IT alerts to assign priorities and solve them accordingly
- Analysis of customer communication through different channels (segmentation in time periods)
- Criminal profiling
- Urban mobility analysis
- Fraud detection (looking for outliers)
- Analysis of athletes' performances
- Crime analysis by geography
- Delivery logistics
- Classification of documents
Now, let's go through some examples that explains the concept of clustering algorithms.
- 數據庫基礎教程(SQL Server平臺)
- ETL數據整合與處理(Kettle)
- Python數據挖掘:入門、進階與實用案例分析
- SQL Server 2008數據庫應用技術(第二版)
- MySQL從入門到精通(第3版)
- 揭秘云計算與大數據
- 大數據算法
- Sybase數據庫在UNIX、Windows上的實施和管理
- 圖數據實戰:用圖思維和圖技術解決復雜問題
- PostgreSQL指南:內幕探索
- R Object-oriented Programming
- 爬蟲實戰:從數據到產品
- MySQL DBA修煉之道
- 企業大數據處理:Spark、Druid、Flume與Kafka應用實踐
- Microsoft Dynamics NAV 2015 Professional Reporting