DS_LESSON1

Subdecks (1)

Cards (51)

  • Machine learning (ML)

    A type of artificial intelligence (AI) that allows software applications to become more accurate at predicting outcomes without being explicitly programmed to do so
  • Machine learning algorithms
    • Use historical data as input to predict new output values
  • Use cases for machine learning
    • Recommendation engines
    • Fraud detection
    • Spam filtering
    • Malware threat detection
    • Business process automation (BPA)
    • Predictive maintenance
  • Why machine learning is important
    It gives enterprises a view of trends in customer behavior and business operational patterns, as well as supports the development of new products
  • Types of machine learning
    • Supervised learning
    • Unsupervised learning
    • Semi-supervised learning
    • Reinforcement learning
  • Supervised learning
    Data scientists supply algorithms with labeled training data and define the variables they want the algorithm to assess for correlations
  • Unsupervised learning
    Algorithms train on unlabeled data, scanning through data sets looking for any meaningful connection
  • Semi-supervised learning
    A mix of supervised and unsupervised learning, involving a small amount of labeled training data that the algorithm can use to explore the data set on its own
  • Reinforcement learning
    Data scientists program an algorithm to complete a task and give it positive or negative cues as it works out how to complete the task
  • Supervised machine learning
    • Requires labeled inputs and desired outputs
    • Good for binary classification, multi-class classification, regression modeling, and ensembling
  • Unsupervised machine learning
    • Does not require data to be labeled
    • Good for clustering, anomaly detection, association mining, and dimensionality reduction
  • Semi-supervised learning
    • Involves a small amount of labeled training data that the algorithm can use to explore the data set on its own
    • Used in areas like machine translation, fraud detection, and data labeling
  • Reinforcement learning
    • Involves programming an algorithm with a distinct goal and a set of rules, and rewarding/punishing it as it works to accomplish the goal
    • Used in robotics, video gameplay, and resource management
    1. Nearest Neighbor (KNN)

    A simple machine learning algorithm based on supervised learning that assumes similarity between new data and available data to classify the new data
  • KNN algorithm

    • Stores all available data and classifies new data based on similarity
    • Can be used for both regression and classification, but mostly used for classification
    • Non-parametric and lazy learner algorithm
  • Use case for KNN
    • Identifying whether a creature is a cat or dog based on similar features
  • Decision Tree Classification Algorithm
    • A supervised learning technique that can be used for both classification and regression problems, but mostly for classification
    • Graphical representation of all possible solutions to a problem/decision based on given conditions
  • Decision Tree
    • Contains decision nodes that make decisions and leaf nodes that are the output
    • Uses the CART algorithm to build the tree
    • Asks questions and follows the yes/no answers to split the tree into subtrees
  • Decision trees mimic human thinking ability and are easy to understand
  • Decision Tree Terminologies
    • Root Node
    • Leaf Node
    • Splitting
    • Branch/Sub Tree
    • Pruning
    • Parent/Child node
  • Decision trees are useful for solving decision-related problems and require less data cleaning compared to other algorithms
  • Parent node
    The root node of the tree
  • Child nodes
    Other nodes in the tree besides the root node
  • Example
    • Candidate with a job offer deciding whether to accept or not
  • Decision tree
    1. Root node (Salary attribute)
    2. Next decision node (distance from the office)
    3. Leaf node
    4. Next decision node (Cab facility)
    5. Leaf node
    6. Leaf nodes (Accepted offers, Declined offer)
  • Advantages of Decision Tree
    • Simple to understand
    • Useful for solving decision-related problems
    • Helps to think about all possible outcomes
    • Less requirement of data cleaning
  • Disadvantages of Decision Tree
    • Contains lots of layers, making it complex
    • May have overfitting issue
    • Computational complexity increases for more class labels
  • Random Forest is a popular machine learning algorithm that belongs to the supervised learning technique
  • Random Forest
    A classifier that contains a number of decision trees on various subsets of the given dataset and takes the average to improve the predictive accuracy of that dataset
  • The greater number of trees in the forest leads to higher accuracy and prevents the problem of overfitting
  • Why use Random Forest
    • Takes less training time
    • Predicts output with high accuracy, even for large datasets
    • Maintains accuracy when a large proportion of data is missing
  • Applications of Random Forest
    • Banking: Identification of loan risk
    • Medicine: Identify disease trends and risks
    • Land Use: Identify areas of similar land use
    • Marketing: Identify marketing trends
  • Advantages of Random Forest
    • Capable of performing both Classification and Regression tasks
    • Capable of handling large datasets with high dimensionality
    • Enhances the accuracy of the model and prevents the overfitting issue
  • Disadvantages of Random Forest
    • Not more suitable for Regression tasks
  • Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which is used for Classification as well as Regression problems
  • Goal of SVM algorithm
    To create the best line or decision boundary that can segregate n dimensional space into classes so that we can easily put the new data point in the correct category in the future
  • Hyperplane
    The best decision boundary
  • Support vectors
    The extreme points/vectors that help in creating the hyperplane
  • Linear SVM

    Used for linearly separable data
  • Non-linear SVM

    Used for non-linearly separated data