Lecture 1.2-Introduction to Data Mining

Cards (33)

  • Data mining
    Also called knowledge discovery and data mining (KDD), extraction of useful patterns from data sources, e.g., databases, texts, web, image
  • Patterns from data mining
    • Valid, novel, potentially useful, understandable
  • Knowledge Discovery in Data
    1. Data
    2. Patterns
    3. Knowledge
  • Knowledge Discovery in Data: Process
    • Interpretation/Evaluation
  • Data
    Volume (Big Data, Small Data), Variety (Transaction, Temporal, Spatial), Velocity (Data Stream, Static)
  • Transactional Data

    Record data with transactions
  • Temporal Data

    Time Series Data, Sequence Data
  • Spatial & Spatial-Temporal Data

    Spatial Data, Spatial-Temporal Data
  • Data Preprocessing
    Missing Values, Summarization
  • Data come from everywhere (hospital, weather station, grocery markets, e-commerce, stock exchange, social media)
  • Data
    Collection of records and their attributes, an attribute is a characteristic of an object
  • Types of Data
    • Record Data, Temporal Data, Spatial & Spatial-Temporal Data, Graph Data, Unstructured Data, Semi-Structured Data
  • Record Data
    Transactional Data
  • Market-Basket Dataset
    • Bread, Coke, Milk
    • Beer, Bread
    • Beer, Coke, Diaper, Milk
    • Beer, Bread, Diaper, Milk
    • Coke, Diaper, Milk
  • Data Matrix
    If data objects have the same fixed set of numeric attributes, the data objects can be thought of as points in a multi-dimensional space, where each dimension represents a distinct attribute
  • Data Matrix Example for Documents
    • Each document becomes a 'term' vector, each term is a component (attribute) of the vector, the value of each component is the number of times the corresponding term occurs in the document
  • Distance Matrix
    Represents the distances between data points
  • Temporal Data
    Sequences Data, Time Series Data
  • Temporal Data
    • Patient Data, Yahoo Finance Website, Biological Sequence Data, Interval Data
  • Spatial & Spatial-Temporal Data

    Spatial Data, Trajectory Data
  • Spatial & Spatial-Temporal Data

    • Spatial Distribution of Objects, Average Monthly Temperature, Dengue Disease Dataset, Hurricane Trajectories, User Movement Data
  • Graph Data
    Data with graph structure
  • Semi-structured Data
    Data that does not obey the tabular structure of data models associated with relational databases or other forms of data tables, but nonetheless contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data
  • Unstructured Data
    Data with no predefined format or organization, making it much more difficult to collect, process, and analyze
  • Data can help us solve specific problems
  • Data Mining Tasks
    • Clustering
    • Classification
    • Frequent Patterns
    • Association Rules
  • What people do with time series data
    • Clustering
    • Classification
    • Query by Content
    • Rule Discovery
    • Motif Discovery
    • Novelty Detection
    • Visualization
    • Motif Association
  • What people do with trajectory data
    • Clustering
    • Motif Discovery
    • Visualization
    • Frequent Travel Patterns
    • Classification
    • Prediction
  • Types of Data
    • Transactional Data
    • Sequence Data
    • Interval Data
    • Time Series Data
    • Spatial Data
    • Spatio-Temporal Data
    • Data Set with Multiple Kinds of Data
  • Data Mining Methods
    • Frequent Pattern Discovery
    • Classification
    • Clustering
    • Outlier Detection
    • Statistical Analysis
  • Distinctions between statistics, machine learning, and data mining are fuzzy
  • Visualization facilitates human discovery and presents discovered results in a visually "nice" way
  • Summarization describes features of a selected group using natural language and graphics, usually in combination with deviation detection or other methods