Agent learns patterns from input without feedback (unlabeled data)
Clustering
Input: Unlabeled dataset
Output: Sets of similar data (based on defined criteria)
Useful for discovering segments in data and applying different business strategies for each segment
Clustering learns patterns and derives groups (clusters) of similar instances
Clustering methods
K-Means clustering
Hierarchical clustering
Gaussian Mixture Models
Spectral clustering
Means Clustering
1. Specify k: target number of clusters
2. Select k random instances as initial centroids
3. Repeat: Assign each instance to closest centroid, update centroids
4. Stop when cluster memberships stay the same
Means Clustering
Instances in the same cluster should be close to each other
Hierarchical Clustering
Dendrogram: A tree diagram depicting closeness through its branches
Derive clusters from generating a dendrogram
Branches in a hierarchical clustering dendrogram show which instances/groups are close to each other
Instances are represented as terminal nodes in a hierarchical clustering dendrogram
Branches connect nodes/subgroups at different levels in a hierarchical clustering dendrogram
Clusters can be derived from a hierarchical clustering dendrogram
Instances
Terminal nodes
Hierarchical Clustering
Branches connect nodes/subgroups at different levels
Clusters can be derived from a dendrogram
Branches show which instances/groups are close to each other
Dendrogram
Allows deriving a clustering for any k number of clusters
Deriving k clusters from a dendrogram
Draw a horizontal line that intersects with k lines
Agglomerative Clustering
Bottom up approach
Start with treating each instance as one cluster
Repeatedly merge the two closest clusters
When there's more than one point in a cluster, use centroid
Other Clustering Methods
K-Means have spherical cluster boundaries
Other methods can handle non-spherical boundaries
Other Clustering Methods
Gaussian Mixture Models
Spectral Clustering
Homogeneity
How similar the instances are within a cluster
Heterogeneity
How different the instances are across different clusters
Clustering Evaluation
Inertia: Measures homogeneity (within-cluster sum of squares)
Silhouette score: Measures and weighs both homogeneity and heterogeneity
Computing Inertia
1. For each cluster, determine the cluster centroid
2. For each instance, square its distance from its cluster centroid
3. Sum all these squares
Inertia should improve per stage in K-Means clustering
Clusterings per stage in K-Means should improve as we do more iterations, but not always the case
Means
A clustering algorithm that partitions the data into K clusters
Means algorithm (iteration 1)
1. Assign each instance to the cluster of the closest centroid
2. For each cluster, get the mean for each feature and set the new centroid
Inertia
A measure of the compactness of the clusters, calculated as the sum of squared distances between each instance and its assigned cluster centroid
Inertia
Ideally, the smaller the inertia value, the better the clustering
Inertia (extreme case)
Number of clusters is the same as number of instances, each instance is its own centroid, so the inertia score is 0 - not a good clustering
Elbow method
Used to determine the optimal number of clusters (k) by plotting the inertia scores for different values of k and identifying the "elbow" point where the inertia starts to level off
Silhouette score
A measure that takes into account both the homogeneity (how close instances are within a cluster) and the heterogeneity (how far instances are from other clusters)
Calculating silhouette score
1. For each instance, calculate the homogeneity score (a(i)) and the heterogeneity score (b(i)), then combine them into the silhouette score s(i)
2. The average of all the s(i) scores is the overall silhouette score for the clustering
When picking between multiple clusterings, choose the one with the highest silhouette score (closest to 1)
Time complexity
K-Means: O(n^2m), where n is the number of instances and m is the number of features
Agglomerative clustering: O(n^3m)
Clustering is an unsupervised learning technique that groups similar instances together without using labeled data
Clustering methods
K-means
Hierarchical (agglomerative) clustering
Other methods that allow for more flexible cluster boundaries