DATA MINING & WAREHOUISNG (CHAPTER 6)

Cards (30)

  • Association
    A technique used in data mining to identify the relationships or co-occurrences between items in a dataset
  • Association analysis
    • Based on the idea of finding the most frequent patterns or itemset in a dataset, where an itemset is a collection of one or more items
    • Can provide valuable insights into consumer behavior and preferences
    • Can help retailers identify the items that are frequently purchased together, which can be used to optimize product placement and promotions
    • Can help e-commerce websites recommend related products to customers based on their purchase history
  • Types of Associations
    • Itemset Associations
    • Sequential Associations
    • Graph-based Associations
  • Itemset Associations
    • The most common type of association analysis, used to discover relationships between items in a dataset
    • A collection of one or more items that frequently co-occur together is called
  • Sequential Associations
    • Used to identify patterns that occur in a specific sequence or order
    • Commonly used in applications such as analyzing customer behavior on e-commerce websites or studying weblogs
  • Graph-based Associations
    • A type of association analysis that involves representing the relationships between items in a dataset as a graph
    • Each item is represented as a node in the graph, and the edges between nodes represent the co-occurrence or relationship between items
    • Used in various applications, such as social network analysis, recommendation systems, and fraud detection
  • Association Rule Mining Algorithms
    • Apriori Algorithm
    • FP-Growth Algorithm
    • Eclat Algorithm
  • Apriori Algorithm
    • Generates frequent item sets from a given dataset by pruning infrequent item sets iteratively
    • Based on the concept that if an item set is frequent, then all of its subsets must also be frequent
    • Computationally expensive, especially for large datasets with many items
  • FP-Growth Algorithm
    • Faster than the Apriori algorithm, especially for large datasets
    • Builds a compact representation of the dataset called a frequent pattern tree (FP-tree), which is used to mine frequent item sets
    • Scans the dataset only twice, to build the FP-tree and to mine the frequent itemsets
    • Can handle datasets with both discrete and continuous attributes
  • Eclat Algorithm
    • A frequent itemset mining algorithm based on the vertical data format
    • First converts the dataset into a vertical data format, where each item and the transaction ID in which it appears are stored
    • Performs a depth-first search on a tree-like structure, representing the dataset's frequent itemsets
    • Efficient regarding both memory usage and runtime, especially for sparse datasets
  • Correlation Analysis
    • A data mining technique used to identify the degree to which two or more variables are related or associated with each other
    • Measures how changes in one variable are related to changes in another variable
    • Correlation can be positive, negative, or zero, depending on the direction and strength of the relationship between the variables
  • Importance of Correlation Analysis
    • Allows us to measure the strength and direction of the relationship between two or more variables
    • Can help identify patterns and trends in the data, make predictions, and select relevant variables for analysis
    • Helps gain valuable insights into complex systems and make informed decisions based on data-driven analysis
  • Types of Correlation Analysis
    • Pearson Correlation Coefficient
    • Kendall Rank Correlation
    • Spearman Rank Correlation
  • Pearson Correlation Coefficient
    • Measures the linear relationship between two continuous variables
    • Ranges from -1 to +1, where -1 indicates a perfect negative correlation, 0 indicates no correlation, and +1 indicates a perfect positive correlation
  • Kendall Rank Correlation
    • A non-parametric measure of the association between two ordinal variables
    • Measures the degree of correspondence between the ranking of observations on two variables
  • Spearman Rank Correlation
    • Another non-parametric measure of the relationship between two variables
    • Measures the degree of association between the ranks of two variables
  • Interpreting Correlation Results
    • A score from +0.5 to +1 indicates a very strong positive correlation
    • A score from -0.5 to -1 indicates a strong negative correlation
    • A score of 0 indicates no correlation
  • Benefits of Correlation Analysis
    • Identifying Relationships
    • Prediction
    • Feature Selection
    • Quality Control
  • Use Cases for Correlation Analysis and Association Mining
    • Market Basket Analysis
    • Medical Research
    • Financial Analysis
    • Fraud Detection
  • Association and correlation in data mining are two important techniques that can help uncover relationships and patterns in large datasets
  • Association mining is used to find frequent itemsets, sequential patterns, and graph-based patterns, while correlation analysis measures the strength and direction of linear relationships between variables
  • Association and correlation in data mining have a wide range of applications, including market basket analysis, medical research, fraud detection, recommender systems, climate research, and financial analysis
  • Eclat (Equivalence Class Clustering and Bottom-up Lattice Transversal)
  • Spearman correlation is similar to the Kendall correlation in that it measures the strength of the relationship between two variables measured on a ranked scale. However, Spearman correlation uses the actual numerical ranks of the data instead of counting the number of concordant and discordant pairs.
  • Prediction - Correlation analysis can help predict one variable's values based on another variable's values.
  • Feature Selection - Correlation analysis can also help select the most relevant features for a particular analysis or model.
  • Market Basket Analysis - Association mining is commonly used in retail and e-commerce industries to identify patterns in customer purchase behavior.
  • Correlation analysis is often used in medical research to explore relationships between different variables.
  • Correlation analysis is frequently used in financial analysis to measure the strength of relationships between different financial variables.
  • Fraud Detection - Association mining can be used to identify behavior patterns associated with fraudulent activity.