Chapter 8

Cards (29)

  • Predictive Data Mining
    makes prediction about unknown data values by using the known value via classification, regression, time series analysis, prediction, etc. 
  • Linear Regression
    A well-known supervised learning approach from classical statistics in which observation of quantitative outcome and one or more corresponding features are used and create an equation for estimating y value.
  • Data Mining Process
    1. Data Sampling
    2. Data Preparation
    3. Data Partitioning
    4. Model Construction
    5. Model Assessment
  • Data Sampling - Extract a sample of data that is relevant to the business problem under consideration.
  • Data Preparation - Manipulate the data to put it in a form suitable for formal modeling.
  • Data Partitioning - Divide the sample data into three sets for the training, validation, and testing of the data mining algorithm performance.
  • Training Set - consists of the data used to build the candidate models. May be used to estimate the slope coefficients in a multiple regression model.
  • Validation Set -Identify which model may be the most accurate at predicting
    observations that were not used to build the model
  • Testing Set - to conservatively estimate this model’s effectiveness when
    applied to data that have not been used to build or select the model.
  • Model Assessment - Evaluate models by comparing performance on the training and validation data sets. Apply the selected model to the test data as a final appraisal of the model’s performance
  • Classification Error - commonly displayed in confusion matrix, which displays a model’s correct and incorrect classification.
  • Sensitivity or Recall - Ability to correctly predict predict class 1 (positive)
  • Specifity - Ability to correctly predict class 0 (negative)
  • Precision - Ability to measure the corresponds to the proportion of observations predicted to be class 1
  • F1 Score - Ability to measure the combination of precision and sensitivity
  • Average Error
    • If negative - overestimate
    • If positive - underestimate
  • Root Mean Squared Error - Provides a measure of how much the predicted value varies from the actual value
  • 3 Common Data Mining
    1. Logistic Regression
    2. K-nearest neighbors
    3. Classification Tree
  • Logistic Regression - attempts to classify a binary categorical outcome (y = 0 or 1) as linear function of explanatory variables
  • Mallow's Statistic - A measure commonly computed by statistical software that can be used to identify models with promising sets of variables
  • K-Nearest Neighbors - A method use to either classify a categorical outcome or to estimate a continuous outcome
  • K-Nearest Neighbors - Measure the similarity between observations, which is most appropriate when all features are continuous
  • K-NN (Lazy Learner) - Uses the entire training set to classify observation in the validation and test sets
  • Classification and Regression Tree - successively partition a data set of observations into increasingly smaller and more homogenous subject
  • Ensemble Method - Predictions are made based on the combinations of a collection of models
  • 3 Ways To Construct An Ensemble of Classification or Regression Tree
    1. Bagging Approach
    2. Boosting Approach
    3. Random Forest
  • Bagging Approach - repeating random sampling of the N observations in the original data with replacement
  • Boosting Approach - Generates its committee of individual base models by sampling multiple training set
  • Random Forest - generates multiple training sets by randomly sampling (w/replacement) the N observations in the original data