DATA MINING & WAREHOUISNG (CHAPTER 6)

Created by

Nae

Cards (30)

Association 
A technique used in data mining to identify the relationships or co-occurrences between items in a dataset
Association analysis
Based on the idea of finding the most frequent patterns or itemset in a dataset, where an itemset is a collection of one or more items
Can provide valuable insights into consumer behavior and preferences
Can help retailers identify the items that are frequently purchased together, which can be used to optimize product placement and promotions
Can help e-commerce websites recommend related products to customers based on their purchase history
Types of Associations
Itemset Associations
Sequential Associations
Graph-based Associations
Itemset Associations 
The most common type of association analysis, used to discover relationships between items in a dataset
A collection of one or more items that frequently co-occur together is called
Sequential Associations 
Used to identify patterns that occur in a specific sequence or order
Commonly used in applications such as analyzing customer behavior on e-commerce websites or studying weblogs
Graph-based Associations 
A type of association analysis that involves representing the relationships between items in a dataset as a graph
Each item is represented as a node in the graph, and the edges between nodes represent the co-occurrence or relationship between items
Used in various applications, such as social network analysis, recommendation systems, and fraud detection
Association Rule Mining Algorithms
Apriori Algorithm
FP-Growth Algorithm
Eclat Algorithm
Apriori Algorithm 
Generates frequent item sets from a given dataset by pruning infrequent item sets iteratively
Based on the concept that if an item set is frequent, then all of its subsets must also be frequent
Computationally expensive, especially for large datasets with many items
FP-Growth Algorithm
Faster than the Apriori algorithm, especially for large datasets
Builds a compact representation of the dataset called a frequent pattern tree (FP-tree), which is used to mine frequent item sets
Scans the dataset only twice, to build the FP-tree and to mine the frequent itemsets
Can handle datasets with both discrete and continuous attributes
Eclat Algorithm 
A frequent itemset mining algorithm based on the vertical data format
First converts the dataset into a vertical data format, where each item and the transaction ID in which it appears are stored
Performs a depth-first search on a tree-like structure, representing the dataset's frequent itemsets
Efficient regarding both memory usage and runtime, especially for sparse datasets
Correlation Analysis 
A data mining technique used to identify the degree to which two or more variables are related or associated with each other
Measures how changes in one variable are related to changes in another variable
Correlation can be positive, negative, or zero, depending on the direction and strength of the relationship between the variables
Importance of Correlation Analysis
Allows us to measure the strength and direction of the relationship between two or more variables
Can help identify patterns and trends in the data, make predictions, and select relevant variables for analysis
Helps gain valuable insights into complex systems and make informed decisions based on data-driven analysis
Types of Correlation Analysis
Pearson Correlation Coefficient
Kendall Rank Correlation
Spearman Rank Correlation
Pearson Correlation Coefficient
Measures the linear relationship between two continuous variables
Ranges from -1 to +1, where -1 indicates a perfect negative correlation, 0 indicates no correlation, and +1 indicates a perfect positive correlation
Kendall Rank Correlation 
A non-parametric measure of the association between two ordinal variables
Measures the degree of correspondence between the ranking of observations on two variables
Spearman Rank Correlation 
Another non-parametric measure of the relationship between two variables
Measures the degree of association between the ranks of two variables
Interpreting Correlation Results
A score from +0.5 to +1 indicates a very strong positive correlation
A score from -0.5 to -1 indicates a strong negative correlation
A score of 0 indicates no correlation
Benefits of Correlation Analysis
Identifying Relationships
Prediction
Feature Selection
Quality Control
Use Cases for Correlation Analysis and Association Mining
Market Basket Analysis
Medical Research
Financial Analysis
Fraud Detection
Association and correlation in data mining are two important techniques that can help uncover relationships and patterns in large datasets
Association mining is used to find frequent itemsets, sequential patterns, and graph-based patterns, while correlation analysis measures the strength and direction of linear relationships between variables
Association and correlation in data mining have a wide range of applications, including market basket analysis, medical research, fraud detection, recommender systems, climate research, and financial analysis
Eclat (Equivalence Class Clustering and Bottom-up Lattice Transversal)
Spearman correlation is similar to the Kendall correlation in that it measures the strength of the relationship between two variables measured on a ranked scale. However, Spearman correlation uses the actual numerical ranks of the data instead of counting the number of concordant and discordant pairs.
Prediction - Correlation analysis can help predict one variable's values based on another variable's values.
Feature Selection - Correlation analysis can also help select the most relevant features for a particular analysis or model.
Market Basket Analysis - Association mining is commonly used in retail and e-commerce industries to identify patterns in customer purchase behavior.
Correlation analysis is often used in medical research to explore relationships between different variables.
Correlation analysis is frequently used in financial analysis to measure the strength of relationships between different financial variables.
Fraud Detection - Association mining can be used to identify behavior patterns associated with fraudulent activity.