Pattern recognition is a strong reason to use ML. However, the ability to recognize patterns only works when the following applies:
The source data is untainted
The training and testing data are unbiased
The correct algorithms are selected
The model is created correctly
Any goals are clearly defined and verified against the training and test data
Classification
Uses of ML rely on patterns to distinguish between types of objects or data. For example, a classification system can detect the difference between a car, a pedestrian, and a street sign (at least, a lot of the time).
There are a lot of articles now that demonstrate all of the ways in which an adversarial attack on a deep learning or ML application will render it nearly useless. So, ML works best in an environment where nothing unexpected happens, but in that environment, it works extremely well.
Recommender systems are another form of ML that try to predict something based on past data. For example, recommender systems play a huge role in online stores where they suggest some items to go with other items a person has purchased. If you’re fond of online buying, you know from experience that the recommender systems are wrong about the additional items more often than not.
To ensure that data remains secure, an organization must monitor and validate it prior to use for issues such as data corruption, bias, errors, and the like. When securing an ML application, it’s also essential to review issues such as these:
Data bias: The data somehow favors a particular group or is skewed in a manner that produces an inaccurate analysis. Model errors give hackers a wedge into gaining access to the application, its model, or underlying data.
Review Issues such as these:
Data corruption: The data may be complete, but some values are incorrect in a way that shows damage, poor formatting, or in a different form. For example, even in adding the correct state name to a dataset, such as Wisconsin, it could appear as WI, Wis, Wisc, Wisconsin, or some other legitimate, but different form.
Review Issues such as these:
Missing critical data: Some data is simply absent from the dataset or could be replaced with a random value, or a placeholder such as N/A or Null for numeric entries.
Review Issues such as these:
Errors in the data: The data is apparently present, but is incorrect in a manner that could cause the application to perform badly and cause the user to make bad decisions. Data errors are often the result of human data entry problems, rather than corruption caused by other sources, such as network errors. Hackers often introduce data errors that have a purpose, such as entering scripts in the place of values.
Review Issues such as these:
Algorithm correctness: Using the incorrect algorithm will create output that doesn’t meet analysis goals, even when the underlying data is correct in every possible manner.
Algorithm bias: The algorithm is designed in such a manner that it performs analysis incorrectly. This problem can also appear when weighting values are incorrect or the algorithm handles feedback values inappropriately. The bottom line is that the algorithm produces a result, but the result favors a particular group or outputs values that are skewed in some way
Review Issues such as these:
Repeatable and verifiable results: ML applications aren’t useful unless they can produce the same results on different systems and it’s possible to verify those results in some way (even if verification requires the use of manual methods).
Attacks:
Evasion: Bypassing the information security functionality built into a system.
Poisoning: Injecting false information into the application’s data stream.
Inference: Using data mining and analysis techniques to gain knowledge about the underlying dataset, and then using that knowledge to infer vulnerabilities in the associated application.
Attacks:
Trojans: Employing various techniques to create code or data that looks legitimate, but is really designed to take over the application or manipulate specific components of it.
Backdoors: Using system, application, or data stream vulnerabilities to gain access to the underlying system or application without providing the required security credentials.
Attacks:
Espionage: Stealing classified, sensitive data or intellectual property to gain an advantage over a person, group, or organization to perform a personnel attack.
Sabotage: Performing deliberate and malicious actions to disrupt normal processes, so that even if the data isn’t corrupted, biased, or damaged in some way, the underlying processes don’t interact with it correctly.
Attacks:
Fraud: Relying on various techniques, such as phishing or communications from unknown sources, to undermine the system, application, or data security in a secretive manner. This level of access can allow for unauthorized or unpaid use of the application and influence ways in which the results are used, such as providing false election projections.
Attacks:
The target of such an attack may not even know that the attack compromised the ML application until the results demonstrate it. In fact, issues such as bias triggered by external dataset corruption can prove so subtle that the ML application continues to function in a compromised state without anyone noticing at all. Many attacks, such as privacy attacks have a direct monetary motive, rather than simple disruption.
Supervised learning
Supervised learning is the most popular and easiest-to-use ML paradigm. In this case, data takes the form of an example and label pair. The algorithm builds a mapping function between the example and its label so that when it sees other examples, it can identify them based on this function. Figure 1.1 provides you with an overview of how this process works:
Supervised learning
Is often used for certain types of classification, such as facial recognition, and prediction, or how well an advertisement will perform based on past examples. This paradigm is susceptible to many attack vectors including someone sending data with the wrong labels or supplying data that is outside the model’s usage.
Unsupervised Learning
When working with unsupervised learning, the algorithm is fed a large amount of data (usually more than is required for supervised learning) and the algorithm uses various techniques to organize, group, or cluster the data. An advantage of unsupervised learning is that it doesn’t require labels: the majority of the data in the world is unlabeled. Most people consider unsupervised learning as data-driven, contrasted with supervised learning, which is task-driven. The underlying strategy is to look for patterns, as shown in Figure 1.2:
Unsupervised learning
Is often used for recommender systems because such systems receive a constant stream of unlabeled data. You also find it used for tracking buying habits and grouping users into various categories. This paradigm is susceptible to a broad range of attack vectors, but data bias, data corruption, data errors, and missing data would be at the top of the list.
Reinforcement learning
It has a feedback loop element built into it. The best way to view reinforcement learning is as a methodology where ML can learn from mistakes. To produce this effect, an agent, the algorithm performing the task, has a specific list of actions that it can take to affect an environment. The environment, in turn, can produce one of two signals as a result of the action. The first signals successful task completion, which reinforces a behavior in the agent. The second provides an environment state so that the agent can detect where errors have occurred.
Reinforcement Learning:
You often see reinforcement learning used for video games, simulations, and industrial processes. Because you’re linking two algorithms together, algorithm choice is a significant priority and anything that affects the relationship between the two algorithms has the potential to provide an attack vector.
Feeding the agent incorrect state information will also cause this paradigm to fail.