Cluster Analysis

Cards (56)

  • Cluster
    A collection of data objects
  • Cluster analysis
    Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters.
  • Unsupervised learning
    • no predefined classes
    • learning by observations
  • Supervised learning
    learning by examples
  • A good clustering method will produce high quality clusters?
    • High intra-class similarity: cohesive within clusters
    • Low inter-class similarity: distinctive between clusters
  • similar (related) 

    to one another within the same group
  • dissimilar (unrelated) 

    to the objects in other groups
  • What are the two-step processing of cluster analysis
    1. Finding similarities between the data
    2. Grouping similar data objects into clusters
  • What are the typical applications in Clustering Analysis?
    • stand-alone tool
    • preprocessing step
  • stand-alone tool
    to get insight into data distribution
  • preprocessing step
    for other algorithms
  • When do we use the cluster analysis?
    Biology
    Information retrieval
    Land use
    Marketing
    City-planning
    Earth-quake studies
    Climate
    Economic Science
  • What are the four preprocessing tools of clustering?

    1. Summarization
    2. Compression
    3. Finding K-nearest Neighbors
    4. Outlier detection
  • Preprocessing for regression, PCA, classification, and association analysis
    Summarization
  • Localizing search to one or a small number of clusters
    Finding K-nearest Neighbors
  • Outliers are often viewed as those “far away” from any cluster
    Outlier detection
  • What Is Good Clustering?
    A good clustering method will produce high quality clusters
  • High intra-class similarity
    cohesive within clusters
  • low inter-class similarity
    distinctive between clusters
  • What Is Good Clustering?
    cohesive within clusters and distinctive between clusters
  • Requirements and Challenges
    1. Scalability
    2. Ability to deal with different types of attributes
    3. Constraint-based clustering
    4. Interpretability and usability
  • Major Clustering Approaches 

    1. Partitioning approach
    2. Hierarchical approach
    3. Density-based approach
    4. Grid-based approach
  • What are typical methods of partitioning approach?
    1. k-means
    2. CLARANS
  • What are typical methods of Hierarchical approach?

    1. Diana
    2. Agnes
    3. BIRCH
    4. CAMELEON
  • What are typical methods of User-guided or constraint-based?

    1. COD (obstacles),
    2. constrained clustering
  • What massive links can be used to cluster objects of Link-based clustering?
    1. SimRank
    2. LinkClus
  • Partitioning methods offer several benefits:

    1. speed, scalability, and simplicity.
    2. They are relatively easy to implement and can handle large datasets.
    3. Partitioning methods are also effective in identifying natural clusters within data and can be used for various applications, such as customer segmentation, image segmentation, and anomaly detection
  • K-means
    is the most popular algorithm in partitioning methods for clustering
  • K-means
    It partitions a dataset into K clusters, where K is a user-defined parameter
  • K-means definition
    The more similar the two data sets are, the more you can see the difference between the two data sets. The more similar the two sets, the more they are compressed. The more they are compressed, the more you see the difference
  • Flowchart of K-Means clustering:
  • Advantages of K-Means
    1. Scalability
    K-means is a scalable algorithm that can handle large datasets with high dimensionality. This is because it only requires calculating the distances between data points and their assigned cluster centroids.
  • Advantages of K-Means
    2. Speed
    K-means is a relatively fast algorithm, making it suitable for real-time or near-real-time applications. It can handle datasets with millions of data points and converge to a solution in a few iterations.
  • Advantages of K-Means
    3. Simplicity
    K-means is a simple algorithm to implement and understand. It only requires specifying the number of clusters and the initial centroids, and it iteratively refines the clusters' centroids until convergence
  • Advantages of K-Means
    4. Interpretability
    K-means provide interpretable results, as the clusters' centroids represent the center points of the clusters. This makes it easy to interpret and understand the clustering results.
  • Disadvantages of K-Means
    1. Curse of dimensionality
    K-means is prone to the curse of dimensionality, which refers to the problem of high-dimensional data spaces. In high-dimensional spaces, the distance between any two data points becomes almost the same, making it difficult to differentiate between clusters
  • Disadvantages of K-Means
    2. User-defined K
    K-means requires the user to specify the number of clusters (K) beforehand. This can be challenging if the user does not have prior knowledge of the data or if the optimal number of clusters is unknown.
  • Disadvantages of K-Means
    3. Non-convex shape clusters
    K-means assumes that the clusters are spherical, which means it cannot handle datasets with non-convex shape clusters. In such cases, other clustering algorithms, such as hierarchical clustering or DBSCAN, may be more suitable
  • Disadvantages of K-Means
    4. Unable to handle noisy data
    K-means are sensitive to noisy data or outliers, which can significantly affect the clustering results. Preprocessing techniques, such as outlier detection or noise reduction, may be required to address this issue.
  • Difference between K-medoids and K-means
    • K-means uses centroids.
    • K-medoids uses medoids.