Cluster Analysis

Created by

anne balansay

Cards (56)

Cluster
A collection of data objects
Cluster analysis 
Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters.
Unsupervised learning
no predefined classes
learning by observations
Supervised learning 
learning by examples
A good clustering method will produce high quality clusters?
High intra-class similarity: cohesive within clusters
Low inter-class similarity: distinctive between clusters
similar (related)  
to one another within the same group
dissimilar (unrelated)  
to the objects in other groups
What are the two-step processing of cluster analysis
Finding similarities between the data
Grouping similar data objects into clusters
What are the typical applications in Clustering Analysis?
stand-alone tool
preprocessing step
stand-alone tool 
to get insight into data distribution
preprocessing step 
for other algorithms
When do we use the cluster analysis?
•Biology
•Information retrieval
•Land use
•Marketing
•City-planning
•Earth-quake studies
•Climate
•Economic Science
What are the four preprocessing tools of clustering? 
Summarization
Compression
Finding K-nearest Neighbors
Outlier detection
Preprocessing for regression, PCA, classification, and association analysis
Summarization
Localizing search to one or a small number of clusters
Finding K-nearest Neighbors
Outliers are often viewed as those “far away” from any cluster
Outlier detection
What Is Good Clustering?
A good clustering method will produce high quality clusters
High intra-class similarity 
cohesive within clusters
low inter-class similarity 
distinctive between clusters
What Is Good Clustering?
cohesive within clusters and distinctive between clusters
Requirements and Challenges
Scalability
Ability to deal with different types of attributes
Constraint-based clustering
Interpretability and usability
Major Clustering Approaches  
Partitioning approach
Hierarchical approach
Density-based approach
Grid-based approach
What are typical methods of partitioning approach?
k-means
CLARANS
What are typical methods of Hierarchical approach? 
Diana
Agnes
BIRCH
CAMELEON
What are typical methods of User-guided or constraint-based? 
COD (obstacles),
constrained clustering
What massive links can be used to cluster objects of Link-based clustering?
SimRank
LinkClus
Partitioning methods offer several benefits: 
speed, scalability, and simplicity.
They are relatively easy to implement and can handle large datasets.
Partitioning methods are also effective in identifying natural clusters within data and can be used for various applications, such as customer segmentation, image segmentation, and anomaly detection
K-means 
is the most popular algorithm in partitioning methods for clustering
K-means 
It partitions a dataset into K clusters, where K is a user-defined parameter
K-means definition
The more similar the two data sets are, the more you can see the difference between the two data sets. The more similar the two sets, the more they are compressed. The more they are compressed, the more you see the difference
Flowchart of K-Means clustering:
Advantages of K-Means
Scalability 
K-means is a scalable algorithm that can handle large datasets with high dimensionality. This is because it only requires calculating the distances between data points and their assigned cluster centroids.
Advantages of K-Means
2. Speed 
K-means is a relatively fast algorithm, making it suitable for real-time or near-real-time applications. It can handle datasets with millions of data points and converge to a solution in a few iterations.
Advantages of K-Means
3. Simplicity 
K-means is a simple algorithm to implement and understand. It only requires specifying the number of clusters and the initial centroids, and it iteratively refines the clusters' centroids until convergence
Advantages of K-Means
4. Interpretability 
K-means provide interpretable results, as the clusters' centroids represent the center points of the clusters. This makes it easy to interpret and understand the clustering results.
Disadvantages of K-Means
Curse of dimensionality
K-means is prone to the curse of dimensionality, which refers to the problem of high-dimensional data spaces. In high-dimensional spaces, the distance between any two data points becomes almost the same, making it difficult to differentiate between clusters
Disadvantages of K-Means
2. User-defined K 
K-means requires the user to specify the number of clusters (K) beforehand. This can be challenging if the user does not have prior knowledge of the data or if the optimal number of clusters is unknown.
Disadvantages of K-Means
3. Non-convex shape clusters 
K-means assumes that the clusters are spherical, which means it cannot handle datasets with non-convex shape clusters. In such cases, other clustering algorithms, such as hierarchical clustering or DBSCAN, may be more suitable
Disadvantages of K-Means
4. Unable to handle noisy data 
K-means are sensitive to noisy data or outliers, which can significantly affect the clustering results. Preprocessing techniques, such as outlier detection or noise reduction, may be required to address this issue.
Difference between K-medoids and K-means
K-means uses centroids.
K-medoids uses medoids.