data mining

Created by

loki

Cards (98)

CRISP-DM stands for Cross Industry Standard Process for Data Mining.
The CRISP-DM Reference Model is an open standard process for data mining that is industry, tool, and application neutral, defining tasks and outputs.
IBM has further developed CRISP-DM as the Analytics Solutions Unified Method for Data Mining/Predictive Analytics (ASUM-DM).
Clustering - groups points such that data points in one cluster are more similar to one another and data points in separate clusters are less similar to one another.
Association Rule Discovery - produce dependency rules indicating that if the set of items in the LHS are in a transaction, then the transaction likely will also contain the RHS item
Regression is studied in statistics and econometrics and has applications like predicting sales amounts of new products based on advertising expenditure, predicting wind velocities, and time series prediction of stock market indices
Data Mining Tasks - Methods
Descriptive Methods
Predictive Methods
Knowledge Discovery in Databases (KDD) Process
Selection
Understand domain
Preprocessing
Data normalization
Noise/outliers
Missing data
Transformation
Data dimensionality reduction
Features engineering
Feature selection
Data Mining
Decide on task & algorithm
Performance?
Interpretation Evaluation
Understand domain
Classification - Find a model for the class attribute as a function of the values of other attributes/features.
Other Data Mining Tasks
Text mining - document clustering, topic models
graph mining - social networks
data stream mining / real time data mining
Mining spatiotemporal data (e.g., moving objects)
Visual data mining
Distributed data mining
Challenges of Data Mining
Scalability
Data ownership and privacy
Dimensionality
Data quality
Complexity and heterogeneous data
Origins of Data Mining
Statistics
Bayes' Theorem (1763)
Regression (1805)
Computer Age
AI
Turing (1936)
Neural Networks (1943)
Evolutionary Computation (1965)
Databases (1970s)
Genetic Algorithms (1975)
Data Mining
KDD (1989)
SVM (1992)
Data Science (2001)
Moneyball (2003)
Today
Big Data
Widespread adoption
DJ Patil (2015)
Chief Data Scientist, White House
Relationship to other Fields
Methods
Artificial Intelligence
Optimization
Statistics
Learning Strategy
Supervised Learning
Unsupervised Learning
Reinforcement Learning
Online Learning
Method and Learning Strategy creates
Machine Learning
Data Mining
Statistical Learning
Learning Strategy:
From what data do we learn?
Is a training set with correct answers available? → Supervised learning
Long-term structure of rewards? → Reinforcement learning
No answer and no reward structure? → Unsupervised learning
Do we have to update the model regularly? → Online learning
Machine Learning involves the study of algorithms that can extract information automatically, i.e., without on-line human guidance.
Data Mining Tool Types
Simple Graphical User Interface
Process Oriented
Programming Oriented
Tools: Simple GUI
Weka: Waikato Environment for Knowledge Analysis (Java API)
Rattle: GUI for Data Mining using R
Tools: Process oriented
SAS Enterprise Miner
IBM SPSS Modeler
RapidMiner
Knime
Orange
Tools: Programming oriented
R
Rattle for beginners
RStudio IDE, markdown, shiny
Microsoft Open R
Python
Numpy, scikit-learn, pandas
Jupyter notebook
Both have similar capabilities. Slightly different focus:
R: statistical computing and visualization
Python: Scripting, big data
Interoperability via rpy2 and rediculate
Data Warehouse
Data -> Information -> Knowledge
Data Warehouse
Subject Oriented: Data warehouses are designed to help you analyze data (e.g., sales data is organized by product and customer).
Integrated: Integrates data from disparate sources into a consistent format.
Nonvolatile: Data in the data warehouse are never overwritten or deleted.
Time Variant: maintains both historical and (nearly) current data
ETL: Extract, Transform and Load
Extracting data from outside sources
Transforming data to fit analytical needs. E.g.,
Clean missing data, wrong data, etc.
Normalize and translate (e.g., 1 → "female")
Join from several sources
Calculate and aggregate data
Loading data into the data warehouse
OnLine Analytical Processing (OLAP)
Store data in "data cubes" for fast OLAP operations.
Requires a special database structure (Snow-flake scheme).
Big Data
"Big data is a term for data sets that are so large or complex that traditional data processing applications are inadequate to deal with them." Wikipedia
3 V's: Volume, velocity, variety, (veracity) Gartner
Characteristics of Big Data
Volume - Scale of Data
Variety - Different forms of Data
Velocity - Analysis of Data
Veracity - Uncertainty of Data
Legal, Privacy and Security Issues
Are we allowed to collect the data?
Are we allowed to use the data?
Is privacy preserved in the process?
Is it ethical to use and act on the data?
Data Mining is interdisciplinary and overlaps significantly with many fields.
Statistics
CS (machine learning, AI, data bases)
Optimization
Data Mining requires a team effort with members who have expertise in several areas:
Data management
Statistics
Programming
Communication
Application domain
Knowledge Discovery in Databases (KDD) Process
Data -> Selection -> Target Data -> Preprocessing -> Preprocessed data -> Transformation ->Transformed Data -> Data Mining -> Patterns/Models -> Interpretation Evaluation -> Knowledge
Reasons to mine data from a commercial viewpoint
Businesses collect and warehouse lots of data
Computers are cheaper and more powerful
Competition to provide better services
Mass customization and recommendation systems
Targeted advertising
Improved logistics
Types of Attributes - Scales
Categorical, Qualitative
Nominal
Ordinal
Quantitative
Interval
Ratio
Record Data
Data that consists of a collection of records, each of which consists of a fixed set of attributes (e.g., from a relational database)
Data Matrix
If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional space, where each dimension represents a distinct attribute
Such data set can be represented by an m by n matrix, where there are m rows, one for each object, and n columns, one for each attribute
Types of data sets
Record
Data Matrix
Document Data
Transaction Data
Graph
World Wide Web
Molecular Structures
Ordered
Spatial Data
Temporal Data
Sequential Data
Genetic Sequence Data
Document Data
Each document becomes a `term' vector, each term is a component (attribute) of the vector, the value of each component is the number of times the corresponding term occurs in the document.
Transaction Data
A special type of record data, where
each record (transaction) involves a set of items.
For example, consider a grocery store. The set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items.
Graph Data
Examples: Generic graph and HTML Links (webpages), a molecule
Ordered Data
Sequences of transactions
Genomic sequence data
Data Quality
Examples of data quality problems:
Noise and outliers
missing values
duplicate data
In a classification task, class information is available → Supervised Learning.