Finding similaritiesbetween data according to the characteristics found in the data and grouping similar data objects into clusters.
Unsupervisedlearning
no predefinedclasses
learning by observations
Supervised learning
learning by examples
A good clustering method will produce high quality clusters?
High intra-class similarity: cohesive within clusters
Low inter-class similarity: distinctive between clusters
similar (related)
to one another within the same group
dissimilar (unrelated)
to the objects in other groups
What are the two-step processing of cluster analysis
Finding similarities between the data
Grouping similar data objects into clusters
What are the typical applications in Clustering Analysis?
stand-alone tool
preprocessing step
stand-alone tool
to get insight into data distribution
preprocessingstep
for other algorithms
When do we use the cluster analysis?
•Biology
•Informationretrieval
•Landuse
•Marketing
•City-planning
•Earth-quakestudies
•Climate
•EconomicScience
What are the four preprocessing tools of clustering?
Summarization
Compression
Finding K-nearest Neighbors
Outlierdetection
Preprocessing for regression, PCA, classification, and association analysis
Summarization
Localizing search to one or a small number of clusters
FindingK-nearestNeighbors
Outliers are often viewed as those “far away” from any cluster
Outlierdetection
What Is Good Clustering?
A good clustering method will produce high quality clusters
High intra-class similarity
cohesive within clusters
lowinter-classsimilarity
distinctive between clusters
What Is Good Clustering?
cohesivewithinclusters and distinctivebetweenclusters
Requirements and Challenges
Scalability
Abilitytodealwithdifferenttypesofattributes
Constraint-basedclustering
Interpretabilityandusability
Major Clustering Approaches
Partitioning approach
Hierarchical approach
Density-based approach
Grid-based approach
What are typical methods of partitioning approach?
k-means
CLARANS
What are typical methods of Hierarchical approach?
Diana
Agnes
BIRCH
CAMELEON
What are typical methods of User-guided or constraint-based?
COD (obstacles),
constrainedclustering
What massive links can be used to cluster objects of Link-based clustering?
SimRank
LinkClus
Partitioning methods offer several benefits:
speed, scalability, and simplicity.
They are relatively easyto implement and canhandle large datasets.
Partitioning methods are also effective in identifyingnatural clusterswithindata and can be used for various applications, such as customer segmentation, image segmentation, and anomaly detection
K-means
is the most popular algorithm in partitioning methods for clustering
K-means
It partitions a dataset into K clusters, where K is a user-defined parameter
K-means definition
The more similar the two data sets are, the more you can see the difference between the two data sets. The more similar the two sets, the more they are compressed. The more they are compressed, the more you see the difference
Flowchart of K-Means clustering:
Advantages of K-Means
Scalability
K-means is a scalable algorithm that can handle large datasets with high dimensionality. This is because it only requires calculating the distances between data points and their assigned cluster centroids.
Advantages of K-Means
2. Speed
K-means is a relatively fast algorithm, making it suitable for real-time or near-real-time applications. It can handle datasets with millions of data points and converge to a solution in a few iterations.
Advantages of K-Means
3. Simplicity
K-means is a simple algorithm to implement and understand. It only requires specifying the number of clusters and the initial centroids, and it iteratively refines the clusters' centroids until convergence
Advantages of K-Means
4. Interpretability
K-means provide interpretable results, as the clusters' centroids represent the center points of the clusters. This makes it easy to interpret and understand the clustering results.
Disadvantages of K-Means
Curse of dimensionality
K-means is prone to the curse of dimensionality, which refers to the problem of high-dimensional data spaces. In high-dimensional spaces, the distance between any two data points becomes almost the same, making it difficult to differentiate between clusters
Disadvantages of K-Means
2. User-definedK
K-means requires the user to specify the number of clusters (K) beforehand. This can be challenging if the user does not have prior knowledge of the data or if the optimal number of clusters is unknown.
Disadvantages of K-Means
3. Non-convexshapeclusters
K-means assumes that the clusters are spherical, which means it cannot handle datasets with non-convex shape clusters. In such cases, other clustering algorithms, such as hierarchical clustering or DBSCAN, may be more suitable
Disadvantages of K-Means
4. Unable tohandle noisy data
K-means are sensitive to noisy data or outliers, which can significantly affect the clustering results. Preprocessing techniques, such as outlier detection or noise reduction, may be required to address this issue.