AI Finals

Cards (50)

  • Clustering
    Learning patterns from unlabeled data
  • Types of learning
    • Supervised Learning
    • Unsupervised Learning
    • Reinforcement Learning
  • Unsupervised Learning
    Agent learns patterns from input without feedback (unlabeled data)
  • Clustering
    • Input: Unlabeled dataset
    • Output: Sets of similar data (based on defined criteria)
    • Useful for discovering segments in data and applying different business strategies for each segment
  • Clustering learns patterns and derives groups (clusters) of similar instances
  • Clustering methods
    • K-Means clustering
    • Hierarchical clustering
    • Gaussian Mixture Models
    • Spectral clustering
    1. Means Clustering
    1. Specify k: target number of clusters
    2. Select k random instances as initial centroids
    3. Repeat: Assign each instance to closest centroid, update centroids
    4. Stop when cluster memberships stay the same
    1. Means Clustering
    • Instances in the same cluster should be close to each other
  • Hierarchical Clustering
    • Dendrogram: A tree diagram depicting closeness through its branches
    • Derive clusters from generating a dendrogram
  • Branches in a hierarchical clustering dendrogram show which instances/groups are close to each other
  • Instances are represented as terminal nodes in a hierarchical clustering dendrogram
  • Branches connect nodes/subgroups at different levels in a hierarchical clustering dendrogram
  • Clusters can be derived from a hierarchical clustering dendrogram
  • Instances
    Terminal nodes
  • Hierarchical Clustering

    • Branches connect nodes/subgroups at different levels
    • Clusters can be derived from a dendrogram
  • Branches show which instances/groups are close to each other
  • Dendrogram
    Allows deriving a clustering for any k number of clusters
  • Deriving k clusters from a dendrogram
    Draw a horizontal line that intersects with k lines
  • Agglomerative Clustering

    • Bottom up approach
    • Start with treating each instance as one cluster
    • Repeatedly merge the two closest clusters
  • When there's more than one point in a cluster, use centroid
  • Other Clustering Methods
    • K-Means have spherical cluster boundaries
    • Other methods can handle non-spherical boundaries
  • Other Clustering Methods
    • Gaussian Mixture Models
    • Spectral Clustering
  • Homogeneity
    How similar the instances are within a cluster
  • Heterogeneity
    How different the instances are across different clusters
  • Clustering Evaluation
    • Inertia: Measures homogeneity (within-cluster sum of squares)
    • Silhouette score: Measures and weighs both homogeneity and heterogeneity
  • Computing Inertia
    1. For each cluster, determine the cluster centroid
    2. For each instance, square its distance from its cluster centroid
    3. Sum all these squares
  • Inertia should improve per stage in K-Means clustering
  • Clusterings per stage in K-Means should improve as we do more iterations, but not always the case
    1. Means
    A clustering algorithm that partitions the data into K clusters
    1. Means algorithm (iteration 1)
    1. Assign each instance to the cluster of the closest centroid
    2. For each cluster, get the mean for each feature and set the new centroid
  • Inertia
    A measure of the compactness of the clusters, calculated as the sum of squared distances between each instance and its assigned cluster centroid
  • Inertia
    Ideally, the smaller the inertia value, the better the clustering
  • Inertia (extreme case)

    • Number of clusters is the same as number of instances, each instance is its own centroid, so the inertia score is 0 - not a good clustering
  • Elbow method
    Used to determine the optimal number of clusters (k) by plotting the inertia scores for different values of k and identifying the "elbow" point where the inertia starts to level off
  • Silhouette score
    A measure that takes into account both the homogeneity (how close instances are within a cluster) and the heterogeneity (how far instances are from other clusters)
  • Calculating silhouette score
    1. For each instance, calculate the homogeneity score (a(i)) and the heterogeneity score (b(i)), then combine them into the silhouette score s(i)
    2. The average of all the s(i) scores is the overall silhouette score for the clustering
  • When picking between multiple clusterings, choose the one with the highest silhouette score (closest to 1)
  • Time complexity
    • K-Means: O(n^2m), where n is the number of instances and m is the number of features
    • Agglomerative clustering: O(n^3m)
  • Clustering is an unsupervised learning technique that groups similar instances together without using labeled data
  • Clustering methods
    • K-means
    • Hierarchical (agglomerative) clustering
    • Other methods that allow for more flexible cluster boundaries