final

Cards (71)

  • Accuracy
    An attractive measure because it is intuitive and widely understood, and naturally extends to multi-class scenarios
  • Problems with accuracy
    • If the outcome is unbalanced, it can be misleading - high performance by simply predicting the majority class for all observations
    • If the outcome is unbalanced, selecting among model configurations with accuracy can be biased toward configurations that predict the majority class
    • Regardless of balance, it considers false positives and false negatives equivalent in their costs
  • Unbalanced outcome distributions

    • May start to be considered unbalanced at ratios of 1:5 (20/80%)
    • In many real life applications (eg, fraud detection) imbalance rations ranging from 1:1000 to 1:5000 are not atypical
  • False positives (screening someone as positive when they do not have heart disease)

    Come with some monetary cost and also likely some distress for the patient
  • False negatives (screening someone as negative when they do have heart disease)

    Mean we send the patient home thinking they are healthy and may suffer a heart attack or other bad outcome
  • Confusion Matrix

    • A perfect classifier has all the observations in true positive and true negative
    • The two types of error (false negative and false positives) have different costs
  • Accuracy
    (TN + TP) / (TN + TP + FN + FP)
  • Sensitivity (recall)

    TP / (TP + FN)
  • Specificity

    TN / (TN + FP)
  • Positive Predictive Value (precision)

    TP / (TP + FP)
  • Negative predictive value

    TN / (TN + FN)
  • If prevalence of positive class is low, your classifier's PPV will be lower, even with good sensitivity and specificity
  • If prevalence of negative class is low, your classifier's NPV will be lower, even with good sensitivity and specificity
  • Balanced Accuracy

    (sensitivity + specificity) / 2
  • F1 Score

    2 * (precision*recall)/(precision+recall)
  • Cohen's Kappa
    • Compares observed accuracy to expected accuracy (random chance)
    • Kappa = (observed accuracy - expected accuracy) / (1 - expected accuracy)
  • ROC curve

    • Provide nice summary visualizations and metric when considering classification thresholds other than 50% (also common when classes are unbalanced or types of error matter)
    • A more formal method to visualize the trade-offs between sensitivity and specificity across all possible thresholds for classification
  • Area under the ROC curve (auROC)

    • Ranges from 1 (perfect) down to approximately 0.5 (random classifier)
    • Values between 0.7-0.8 are considered fair
    • Values between 0.8-0.9 are considered good
    • Values above 0.9 are considered excellent
    • Probability that the classifier will rank/predict a randomly selected true positive observation higher than a randomly selected true negative observation
    • Alternatively, it can be thought of as the average sensitivity across all decision thresholds
  • Aggregate measures best choice for model selection
  • For classification: Accuracy, Balanced accuracy, auROC, Kappa
  • For regression: RMSE, R^2, MAE (mean absolute error)
  • Use performance metric that is the best aligned with your problem
  • Decreasing the threshold (probability) for classifying a case as positive

    • Increases sensitivity and decreases specificity
    • Decreases FN but increases FP
  • Up-Sampling resamples the minority class observations with replacement within the training set to increase the number of total observations of the minority class
  • Down-Sampling resamples majority class observations within the training set to decrease/match the number of total observations of the minority class
  • SMOTE
    • Synthetic minority over-sampling technique
    • Up-samples the minority class by synthesizing new observations
    • An observation is randomly selected from the minority class, this observation's K-nearest-neighbors are determined, the new synthetic observation retains the outcome but a random combination of the predictors values from the randomly selected observations and its neighbors
  • Decision Trees
    • Tree based statistical algorithms
    • A class of flexible, non-parametric algorithms
    • Work by partitioning the feature space into a number of smaller non-overlapping regions with similar responses by using a set of splitting rules
    • Make predictions by assigning a single prediction to each of these regions
    • Can produce simple rules that are easy to interpret and visualize with tree diagrams
    • Typically lack in predictive performance compared to other common algorithms
    • Serve as base learners for more powerful ensemble approaches
  • Decision Tree Partitioning
    1. Partitions training data into homogeneous subgroups (i.e., groups with similar response values)
    2. Nodes are formed recursively using binary partitions by asking simple yes-or-no questions about each features
    3. Done a number of times until a suitable stopping criteria is satisfied
    4. After all the partitioning has been done, the model predicts a single value for each region
  • Binary Recursive Partitioning

    • Each split (or rule) depends on the splits above it
    • The algorithm first identifies the "best" feature to partition the observations in the root node into one of two new regions
    • For regression problems, the "best" feature is the one that maximizes the reduction in SSE
    • For classification the split is selected to maximize the reduction in cross-entropy
    • The splitting process is then repeated on each of the two new nodes
  • Early Stopping

    • Explicitly stop the growth of the tree early
    • A maximum tree depth is reached
    • The node has too few cases to be considered for further splits
  • Pruning
    • We let the tree grow large and then prune it back to an optimal size
    • Apply a penalty to the cost function/impurity
    • Big values for cost complexity will result in less complex trees
  • Feature engineering for decision trees is simpler than other algorithms because there are very few pre-processing requirements
  • Bagging
    • Bootstrap aggregating
    • Fitting multiple versions of a prediction model
    • Combining them into an aggregated prediction
    • B bootstraps of the original training data are created
    • The model configuration is fit to each bootstrap sample
    • These individual fitted models are called the base learners
    • Final predictions are made by aggregating the predictions across all of the individual base learners
  • Bagging works especially well for flexible, high variance base learners
  • With respect to statistical algorithms that you know, decision trees are high variance
  • In contrast, bagging a linear model would likely not improve much upon the base learners' performance
  • With decision trees, optimal performance is often found by bagging 50-500 base learner trees
  • Final predictions

    Made by aggregating the predictions across all of the individual base learners
  • For regression
    Final predictions can be the average of base learner predictions
  • For classification
    Final predictions can either be the average of estimated class probabilities or majority class vote across individual base learners