A type of artificial intelligence (AI) that allows software applications to become more accurate at predicting outcomes without being explicitly programmed to do so
Machine learning algorithms
Use historical data as input to predict new output values
Use cases for machine learning
Recommendation engines
Fraud detection
Spam filtering
Malware threat detection
Business process automation (BPA)
Predictive maintenance
Why machine learning is important
It gives enterprises a view of trends in customer behavior and business operational patterns, as well as supports the development of new products
Types of machine learning
Supervised learning
Unsupervised learning
Semi-supervised learning
Reinforcement learning
Supervised learning
Data scientists supply algorithms with labeled training data and define the variables they want the algorithm to assess for correlations
Unsupervised learning
Algorithms train on unlabeled data, scanning through data sets looking for any meaningful connection
Semi-supervised learning
A mix of supervised and unsupervised learning, involving a small amount of labeled training data that the algorithm can use to explore the data set on its own
Reinforcement learning
Data scientists program an algorithm to complete a task and give it positive or negative cues as it works out how to complete the task
Supervised machine learning
Requires labeled inputs and desired outputs
Good for binary classification, multi-class classification, regression modeling, and ensembling
Unsupervised machine learning
Does not require data to be labeled
Good for clustering, anomaly detection, association mining, and dimensionality reduction
Semi-supervised learning
Involves a small amount of labeled training data that the algorithm can use to explore the data set on its own
Used in areas like machine translation, fraud detection, and data labeling
Reinforcement learning
Involves programming an algorithm with a distinct goal and a set of rules, and rewarding/punishing it as it works to accomplish the goal
Used in robotics, video gameplay, and resource management
Nearest Neighbor (KNN)
A simple machine learning algorithm based on supervised learning that assumes similarity between new data and available data to classify the new data
KNN algorithm
Stores all available data and classifies new data based on similarity
Can be used for both regression and classification, but mostly used for classification
Non-parametric and lazy learner algorithm
Use case for KNN
Identifying whether a creature is a cat or dog based on similar features
Decision Tree Classification Algorithm
A supervised learning technique that can be used for both classification and regression problems, but mostly for classification
Graphical representation of all possible solutions to a problem/decision based on given conditions
Decision Tree
Contains decision nodes that make decisions and leaf nodes that are the output
Uses the CART algorithm to build the tree
Asks questions and follows the yes/no answers to split the tree into subtrees
Decision trees mimic human thinking ability and are easy to understand
Decision Tree Terminologies
Root Node
Leaf Node
Splitting
Branch/Sub Tree
Pruning
Parent/Child node
Decision trees are useful for solving decision-related problems and require less data cleaning compared to other algorithms
Parent node
The root node of the tree
Child nodes
Other nodes in the tree besides the root node
Example
Candidate with a job offer deciding whether to accept or not
Decision tree
1. Root node (Salary attribute)
2. Next decision node (distance from the office)
3. Leaf node
4. Next decision node (Cab facility)
5. Leaf node
6. Leaf nodes (Accepted offers, Declined offer)
Advantages of Decision Tree
Simple to understand
Useful for solving decision-related problems
Helps to think about all possible outcomes
Less requirement of data cleaning
Disadvantages of Decision Tree
Contains lots of layers, making it complex
May have overfitting issue
Computational complexity increases for more class labels
Random Forest is a popular machine learning algorithm that belongs to the supervised learning technique
Random Forest
A classifier that contains a number of decision trees on various subsets of the given dataset and takes the average to improve the predictive accuracy of that dataset
The greater number of trees in the forest leads to higher accuracy and prevents the problem of overfitting
Why use Random Forest
Takes less training time
Predicts output with high accuracy, even for large datasets
Maintains accuracy when a large proportion of data is missing
Applications of Random Forest
Banking: Identification of loan risk
Medicine: Identify disease trends and risks
Land Use: Identify areas of similar land use
Marketing: Identify marketing trends
Advantages of Random Forest
Capable of performing both Classification and Regression tasks
Capable of handling large datasets with high dimensionality
Enhances the accuracy of the model and prevents the overfitting issue
Disadvantages of Random Forest
Not more suitable for Regression tasks
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which is used for Classification as well as Regression problems
Goal of SVM algorithm
To create the best line or decision boundary that can segregate n dimensional space into classes so that we can easily put the new data point in the correct category in the future
Hyperplane
The best decision boundary
Support vectors
The extreme points/vectors that help in creating the hyperplane