makes prediction about unknown data values by using the known value via classification, regression, time series analysis, prediction, etc.
Linear Regression
A well-known supervised learning approach from classical statistics in which observation of quantitative outcome and one or more corresponding features are used and create an equation for estimating y value.
Data Mining Process
Data Sampling
Data Preparation
Data Partitioning
Model Construction
Model Assessment
Data Sampling - Extract a sample of data that is relevant to the business problem under consideration.
Data Preparation - Manipulate the data to put it in a form suitable for formal modeling.
Data Partitioning - Divide the sample data into three sets for the training, validation, and testing of the data mining algorithm performance.
Training Set - consists of the data used to build the candidate models. May be used to estimate the slope coefficients in a multiple regression model.
Validation Set -Identify which model may be the most accurate at predicting
observations that were not used to build the model
Testing Set - to conservatively estimate this model’s effectiveness when
applied to data that have not been used to build or select the model.
Model Assessment - Evaluate models by comparing performance on the training and validation data sets. Apply the selected model to the test data as a final appraisal of the model’s performance
Classification Error - commonly displayed in confusion matrix, which displays a model’s correct and incorrect classification.
Sensitivity or Recall - Ability to correctly predict predict class 1 (positive)
Specifity - Ability to correctly predict class 0 (negative)
Precision - Ability to measure the corresponds to the proportion of observations predicted to be class 1
F1 Score - Ability to measure the combination of precision and sensitivity
Average Error
If negative - overestimate
If positive - underestimate
Root Mean Squared Error - Provides a measure of how much the predicted value varies from the actual value
3 Common Data Mining
Logistic Regression
K-nearest neighbors
Classification Tree
Logistic Regression - attempts to classify a binary categorical outcome (y = 0 or 1) as linear function of explanatory variables
Mallow's Statistic - A measure commonly computed by statistical software that can be used to identify models with promising sets of variables
K-Nearest Neighbors - A method use to either classify a categorical outcome or to estimate a continuous outcome
K-Nearest Neighbors - Measure the similarity between observations, which is most appropriate when all features are continuous
K-NN (Lazy Learner) - Uses the entire training set to classify observation in the validation and test sets
Classification and Regression Tree - successively partition a data set of observations into increasingly smaller and more homogenous subject
Ensemble Method - Predictions are made based on the combinations of a collection of models
3 Ways To Construct An Ensemble of Classification or Regression Tree
Bagging Approach
Boosting Approach
Random Forest
Bagging Approach - repeating random sampling of the N observations in the original data with replacement
Boosting Approach - Generates its committee of individual base models by sampling multiple training set
Random Forest - generates multiple training sets by randomly sampling (w/replacement) the N observations in the original data