data mining

Cards (98)

  • CRISP-DM stands for Cross Industry Standard Process for Data Mining.
  • The CRISP-DM Reference Model is an open standard process for data mining that is industry, tool, and application neutral, defining tasks and outputs.
  • IBM has further developed CRISP-DM as the Analytics Solutions Unified Method for Data Mining/Predictive Analytics (ASUM-DM).
  • Clustering - groups points such that data points in one cluster are more similar to one another and data points in separate clusters are less similar to one another.
  • Association Rule Discovery - produce dependency rules indicating that if the set of items in the LHS are in a transaction, then the transaction likely will also contain the RHS item
  • Regression is studied in statistics and econometrics and has applications like predicting sales amounts of new products based on advertising expenditure, predicting wind velocities, and time series prediction of stock market indices
  • Data Mining Tasks - Methods
    • Descriptive Methods
    • Predictive Methods
  • Knowledge Discovery in Databases (KDD) Process
    • Selection
    • Understand domain
    • Preprocessing
    • Data normalization
    • Noise/outliers 
    • Missing data
    • Transformation
    • Data dimensionality reduction 
    • Features engineering 
    • Feature selection
    • Data Mining
    • Decide on task & algorithm 
    • Performance? 
    • Interpretation Evaluation
    • Understand domain
  • Classification - Find a model for the class attribute as a function of the values of other attributes/features.
  • Other Data Mining Tasks
    • Text mining - document clustering, topic models
    • graph mining - social networks
    • data stream mining / real time data mining
    • Mining spatiotemporal data (e.g., moving objects)
    • Visual data mining
    • Distributed data mining
  • Challenges of Data Mining
    • Scalability
    • Data ownership and privacy
    • Dimensionality
    • Data quality
    • Complexity and heterogeneous data
  • Origins of Data Mining
    • Statistics
    • Bayes' Theorem (1763)
    • Regression (1805)
    • Computer Age
    • AI
    • Turing (1936)
    • Neural Networks (1943)
    • Evolutionary Computation (1965)
    • Databases (1970s)
    • Genetic Algorithms (1975)
    • Data Mining
    • KDD (1989)
    • SVM (1992)
    • Data Science (2001)
    • Moneyball (2003)
    • Today
    • Big Data
    • Widespread adoption
    • DJ Patil (2015)
    • Chief Data Scientist, White House
  • Relationship to other Fields
    • Methods
    • Artificial Intelligence
    • Optimization
    • Statistics
    • Learning Strategy
    • Supervised Learning
    • Unsupervised Learning
    • Reinforcement Learning
    • Online Learning
    • Method and Learning Strategy creates
    • Machine Learning
    • Data Mining
    • Statistical Learning
  • Learning Strategy:
    • From what data do we learn?
    • Is a training set with correct answers available? → Supervised learning
    • Long-term structure of rewards? → Reinforcement learning
    • No answer and no reward structure? → Unsupervised learning
    • Do we have to update the model regularly? → Online learning
  • Machine Learning involves the study of algorithms that can extract information automatically, i.e., without on-line human guidance.
  • Data Mining Tool Types
    • Simple Graphical User Interface
    • Process Oriented
    • Programming Oriented
  • Tools: Simple GUI
    • Weka: Waikato Environment for Knowledge Analysis (Java API)
    • Rattle: GUI for Data Mining using R
  • Tools: Process oriented
    • SAS Enterprise Miner
    • IBM SPSS Modeler
    • RapidMiner
    • Knime
    • Orange
  • Tools: Programming oriented
    • R
    • Rattle for beginners
    • RStudio IDE, markdown, shiny
    • Microsoft Open R
    • Python
    • Numpy, scikit-learn, pandas
    • Jupyter notebook
    • Both have similar capabilities. Slightly different focus:
    • R: statistical computing and visualization
    • Python: Scripting, big data
    • Interoperability via rpy2 and rediculate
  • Data Warehouse
    • Data -> Information -> Knowledge
  • Data Warehouse
    • Subject Oriented: Data warehouses are designed to help you analyze data (e.g., sales data is organized by product and customer).
    • Integrated: Integrates data from disparate sources into a consistent format.
    • Nonvolatile: Data in the data warehouse are never overwritten or deleted.
    • Time Variant: maintains both historical and (nearly) current data
  • ETL: Extract, Transform and Load
    • Extracting data from outside sources
    • Transforming data to fit analytical needs. E.g.,
    • Clean missing data, wrong data, etc.
    • Normalize and translate (e.g., 1 → "female")
    • Join from several sources
    • Calculate and aggregate data
    • Loading data into the data warehouse
  • OnLine Analytical Processing (OLAP)
    • Store data in "data cubes" for fast OLAP operations.
    • Requires a special database structure (Snow-flake scheme).
  • Big Data
    • "Big data is a term for data sets that are so large or complex that traditional data processing applications are inadequate to deal with them." Wikipedia
    • 3 V's: Volume, velocity, variety, (veracity) Gartner
  • Characteristics of Big Data
    • Volume - Scale of Data
    • Variety - Different forms of Data
    • Velocity - Analysis of Data
    • Veracity - Uncertainty of Data
  • Legal, Privacy and Security Issues
    • Are we allowed to collect the data?
    • Are we allowed to use the data?
    • Is privacy preserved in the process?
    • Is it ethical to use and act on the data?
  • Data Mining is interdisciplinary and overlaps significantly with many fields.
    • Statistics
    • CS (machine learning, AI, data bases)
    • Optimization
  • Data Mining requires a team effort with members who have expertise in several areas:
    • Data management
    • Statistics
    • Programming
    • Communication
    • Application domain
  • Knowledge Discovery in Databases (KDD) Process
    • Data -> Selection -> Target Data -> Preprocessing -> Preprocessed data -> Transformation ->Transformed Data -> Data Mining -> Patterns/Models -> Interpretation Evaluation -> Knowledge
  • Reasons to mine data from a commercial viewpoint
    • Businesses collect and warehouse lots of data
    • Computers are cheaper and more powerful
    • Competition to provide better services
    • Mass customization and recommendation systems
    • Targeted advertising
    • Improved logistics
  • Types of Attributes - Scales
    • Categorical, Qualitative
    • Nominal
    • Ordinal
    • Quantitative
    • Interval
    • Ratio
  • Record Data
    Data that consists of a collection of records, each of which consists of a fixed set of attributes (e.g., from a relational database)
  • Data Matrix
    • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional space, where each dimension represents a distinct attribute
    • Such data set can be represented by an m by n matrix, where there are m rows, one for each object, and n columns, one for each attribute
  • Types of data sets
    • Record
    • Data Matrix
    • Document Data
    • Transaction Data
    • Graph
    • World Wide Web
    • Molecular Structures
    • Ordered
    • Spatial Data
    • Temporal Data
    • Sequential Data
    • Genetic Sequence Data
  • Document Data
    • Each document becomes a `term' vector, each term is a component (attribute) of the vector, the value of each component is the number of times the corresponding term occurs in the document.
  • Transaction Data
    • A special type of record data, where
    • each record (transaction) involves a set of items.
    • For example, consider a grocery store. The set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items.
  • Graph Data
    • Examples: Generic graph and HTML Links (webpages), a molecule
  • Ordered Data
    • Sequences of transactions
    • Genomic sequence data
  • Data Quality
    • Examples of data quality problems:
    • Noise and outliers
    • missing values
    • duplicate data
  • In a classification task, class information is available → Supervised Learning.