Usually, everything first starts at the SOC (Security Operations Centre). Here, a team of analysts monitor the security of the organisation. In essence, this team is monitoring events in the organisation's estate. If an event is an anomaly or unexpected, an alert is generated. Alerts can still be incorrect, thus these are then further investigated by the analysts. However, if the alert is real, the team will perform a triage process to determine the severity. If the severity of the alert is sufficient, an incident will be raised.
The SOC can therefore be seen as the filter. Not all events make it to incidents. For example, organisations often receive thousands of phishing emails every day. Most of these are automatically blocked by intrusion prevention systems such as their spam filter. Even if the user were to interact with most of these emails and execute malware, for example, the Anti Virus or Endpoint Detection and Response software would automatically block this. In these cases, an alert will be generated, and the SOC team will deal with it, such as updating mail filtering rules or signatures of the AV or EDR.
An incident, is when in the triage phase, we discover that there may still be further impact from the alert and when we don't have all of the information required to deal with it. For example, let's say that an alert was generated that an anomalous logon occurred to one of our servers, we have quite several questions that still need answering:
Whose account was used?
Where did the logon occur from?
Where was that account being used before the logon?
Has there been any other potentially anomalous activity seen with that account?
Incident Response and Management
When an alert's severity is high enough to become an incident, that is where Incident Response and Incident Management usually kick in. Often, these two are combined and simply called Incident Response. However, there are distinct features to both of these that are worth discussing.
Incident Response
Incident Response covers the technical aspect of dealing with an incident. This is the portion that is responsible for answering the primary question: What happened?
IR:
EDR or AV Alert - Usually these tools would create an alert for anomalous activity that has occurred on a specific host. For example, the EDR could alert that there were attempts made to monitor the keystrokes of a user.
Network Tap Alert - Network taps provide alerts for anomalous network activity. For example, there could be an alert that a host is scanning other hosts in the estate.
SIEMAlert - The Security Information and Event Management (SIEM) system could alert on a custom rule that was created by the analysts. For example, an impossible travel rule where a user's account is being logged in from two different countries simultaneously.
When an alert is created, a lot of information is provided to the analyst. The first step is to investigate this information to better understand what is happening. In these systems, when an alert is generated, other key pieces of information are also attached to the alert. For example, in the case of the SIEM alert, the analyst would be able to review not only the latest logon events with the user's account, but the history of their logon events for the last couple of months.
Sometimes the alert information is not sufficient and we have to gather more information than what is currently provided. This process is usually referred to as Digital Forensics.We perform a more hands-on investigation that can include the following:
Recovering the hard disk from the infected host to investigate how the malware got on there in the first place.
Recovering the data from volatile memory (such as from the computer's RAM) from the infected host to investigate how the malware works.
Recovering system and network logs from several devices to uncover how the malware spread.
Incident Management
Incident Management covers the process aspect of dealing with an incident. This is the portion that is responsible for answering the primary question:
How do we respond to what happened?
Once we understand the scope of the incident, the next question is how we will manage the incident. Incident Management has to take care of several things, such as:
Triaging the incident to accurately update the severity of the incident as new information becomes available and getting more stakeholders involved to help deal with the incident, such as Subject Matter Experts (SMEs).
Guiding the incident actions through the use of playbooks.
Deciding which containment, eradication, and recovery actions will be taken to deal with the incident.
Once we understand the scope of the incident, the next question is how we will manage the incident. Incident Management has to take care of several things, such as:
Deciding the communication that will be sent internally and externally while the team deals with the incident.
Documenting the information about the incident, such as the actions taken and the effect that they had on dealing with the incident.
Closing the incident and taking the information to learn from the incident and improve future processes and procedures.
Level 1: SOC Incident
These are often not even classified as incidents. Usually, these require a purely technical approach. At this level, upon investigation of our example, the analyst finds that it is an isolated event and therefore simply updates the mail filtering rules to block the sender. These levels of incidents can happen several times a day and are usually quick to deal with and the analyst deals with this themselves.
However, in our example, a Computer Emergency Readiness Team (CERT) Incident may be invoked if the investigation found that several users received the email.
Level 2: CERT Incident
At level two, several analysts in the SOC may be involved in the investigation. A CERT Incident is one where we don't yet have enough to raise the alarm bells. Still, we are concerned and therefore performing additional investigation to determine the scope of the incident. Usually, the analyst would request assistance and more members of the SOC team would get involved. In our example, at this point, we would be investigating if any of those users interacted with the email. We would also like to better understand what the email does.
Level 2: CERT Incident
If we were able to stop the incident before any of the users interacted with the email, we would usually stop at this level. However, if we discover that the email contains malware and that some of the users actually interacted with the email, we would invoke a Computer Security Incident Response Team (CSIRT) incident.
Level 3: CSIRT Incident
At level three, the entire SOC is placed on high alert and actively working to resolve the incident. At this point, the entire SOC team will focus on the single incident to deal with it. Analysts and the forensic team work to uncover the full scope of the incident and the management team is taking action against the threat actor to contain the spread of the malware, eradicate it from hosts where it is discovered, and recover affected systems.
Level 3: CSIRT Incident
If the team is able to stop the spread of the attack before any disruptions can occur or the threat actor can escalate their privileges within the estate, the CSIRT team will close the incident. However, if it is determined that the scope is larger through investigation, we would invoke a Crisis Management Team (CMT) Incident.
Level 4: CMT Incident
All hands on deck and officially a full-scale cyber crisis. The CMT would usually consist of several key business stakeholders such as the entire executive suite, members from the legal and communication teams, as well as other external parties, such as the regulator or police. At this level, we start to move into the territory of what is called "nuclear" actions. Rather than simple actions to contain, eradicate, and recover, this team can authorise the use of nuclear actions, such as taking the entire organisation offline to limit the incident's damage.
Incident Management Process:
IM: Preparation
Preparation is key to effectively deal with an incident. During an incident, it is often stressful and every minute counts to ensure that the incident can be dealt with as fast as possible to reduce the amount of damage. In these stressful environments, it is often easy to forget things, which then could have severe consequences.
In order to prevent this, a team has to prepare to deal with an incident. The better the team is prepared, the less likely simple mistakes will be made during the incident.
IM: Preparation: In order to prepare, there are several things that the team can perform, such as:
Identify and document key stakeholders and call trees that will be used during an incident
Create and update playbooks that aid the team in following a set process for incidents with a known nature
Exercise the team's ability to deal with an incident through tabletop exercises and cyber war games
Continuously perform threat hunting to help create new alert rules based on modern attacker techniques
IM: Detection and Analysis
Often organisations will split the detection and analysis phases into two. This is to introduce a middle step called triaging. As mentioned before, not all alerts will classify as an incident and even if an incident occur, there are different levels of incidents. The triage step is responsible for determining the severity of the incident. However, in the NIST framework, this is incorporated in this detection and analysis phase.
IM: Detection and Analysis
This is the primary phase for incident response, where we aim to answer the question of what has happened. During this phase, the blue team works to better understand the scope of the incident and provide this information to the incident manager. This can include actions such as the following:
Reviewing alerts in the AV, EDR, and SIEM dashboards
Performing a forensic investigation of artefacts both on systems and the network
Analysing malware that is discovered to better understand how it works and create new signatures that can be used to identify it
IM: Containment, Eradication, and Recovery
Once the scope of the incident is better understood, the team will start with containment, eradication, and recovery. This is the primary phase of incident management, where we try to deal with the incident:
Containment - Actions taken to "stop the bleed". These are actions meant to stop the incident from growing larger.
Eradication - Actions taken to eradicate the threat actor from the estate.
Recovery - Actions taken to recover the environment allow the organisation to go back to Business as Usual (BAU).
IM: Containment, Eradication, and Recovery
If you start eradication or recovery before containment, the threat actor will be able to persist. For example, if the threat actor compromised Active Directory and we simply changed each account's password (eradication action), the threat actor could simply leverage their current permissions to recover the credentials again. We would first have to ensure that we have closed-off access to the threat actor before taking other actions.
IM: Post-Incident Activity
Once an incident has been closed, that isn't the end of the incident management process. As a last step, we want to evaluate what happened during the incident in order to learn lessons and improve how we deal with incidents in the future. As such, we learn from these incidents to better prepare ourselves to deal with the next one.
IR/IM Pitfalls:
Insufficient Hardening
Insufficient Hardening is something that happens even before the incident. Once a solution has been deployed, the organisation simply moves on to the next one. However, in security engineering, there is an important step called Hardening. Once a solution is deployed, there may still be some configurations that did not adhere to security best practices but were performed to get the solution up and running faster. The hardening process reverses these configurations to bring them back in line with security best practices.
IR/IM Pitfalls:Insufficient Logging
In order for the blue team to be alerted to incidents, they first have to receive the relevant information that can result in events and alerts. Often it is seen that organisations are not performing adequate logging of information. This can be seen as "flying blind" since the blue team would not be able to even know that an incident is occurring.
IR/IM Pitfalls:Insufficient Logging
A problem is the cost of ingesting log information. Often SIEM providers will charge clients based on the amount of throughput of data. This results in organisations limiting the amount of logs that are ingested. It is often costly to have remote devices, such as ATMs, send their log information over a mobile network. All of this can lead to reduced visibility for the blue team. Although some of this log information will be available on the device itself, retention is often reduced and in worse cases, a threat actor might have removed these local logs.
IR/IM Pitfalls:
Insufficient- and Over-Alerting
If an alert generates too much noise by having too many false positives, it can lead to the team ignoring the alert. This is similar to the "cry wolf" situation. In the event that an actual incident occurs raising an alert, the team could simply ignore it until there is a great impact.
IR/IM Pitfalls: Insufficient Determination of Incident Scope
In cases where the incident scope is underestimated, the actions taken against the threat actor would not be sufficient to eradicate them from the system. In cases where the incident scope is overestimated, drastic actions could be taken by the team that would result in unnecessary business disruptions.
Continuous preparation for incidents is required to upskill the team and help address this issue.
IR/IM Pitfalls: Insufficient Determination of Incident Scope: Insufficient Accountability
Another problem during incidents is inaction. It is incredibly important to understand that there is a difference between discussing containment, eradication, and recovery actions and performing them. Often during incidents, actions will be discussed, but no one person will be made responsible for actually performing the action. This then often leads to the incident growing as everyone thinks something has already been performed, when in fact, it hasn't.
Insufficient Accountability
Effective Incident Management and note-taking can help address this issue. The incident manager can document the actions that are taken and ensure that a responsible individual is nominated to not only perform the action, but provide the manager with feedback once the action has been taken.
IR/IM Pitfalls:
Insufficient Backups
In the event that an incident results in disruptive actions such as ransomware being deployed, the only saving grace is backups that can be used to recover the estate. However, if backup processes and policies were not clearly established and followed, it would not be possible to recover from the incident.
IR/IM Pitfalls:
Insufficient Backups
Furthermore, sometimes backups are not sufficiently isolated. In modern times where the primary focus is on availability, often legacy backups are removed in favour of new High Availability Disaster Recovery environments. The issue with this however is that if ransomware executes on the main system, it is replicated as such in the DR environment. Therefore, offline and remote backups are just as important today.
Volatility of Evidence
The biggest mistake that is performed during incidents is shutting the host down. This is wrong for the following two main reasons:
A significant amount of important evidence is found in volatile spaces, meaning it is lost as soon as the device loses power
It immediately alerts the threat actor that we might be on to them, meaning they might start a more disruptive attack
For the latter, it means that as a first step, we should not even disable network access on the host, as this can have the same effect.
Volatility of Evidence:
We want to make sure that evidence is preserved. We also want to ensure that we preserve evidence in order of volatility. While a digital forensics analyst will usually be involved in capturing most of this evidence, it is important to be aware of the different types of evidence and why we must do everything in our power to preserve it.
Registers and Cache
Registers and cache are extremely volatile and constantly changing as the host executes different applications. In a matter of split seconds, this data can change. While we would never be fast enough to capture this evidence at the exact moment of becoming aware of the incident, we should do it as soon as possible. This evidence can be vital for malware analysis to understand what the malware performed on the host. In most incidents, we would not capture this information, as it is simply too volatile.
Routing Table, ARPCache, Process Table, Kernel Statistics and Memory
We must be aware that more hosts might have been infected. We also want to have a understanding of not just this host, but also if the host communicated to any other hosts in the network. Therefore, we need to capture information such as the routing and ARP tables. Routes and ARP entries have a specific time-to-live, meaning if we are unable to capture this data in time, we might not have the full picture of what network communication took place at the time the incident occurred. These can be captured from the host itself.
Routing Table, ARPCache, Process Table, Kernel Statistics and Memory
Regarding the actual suspected host, we want to better understand what applications were running and what they were doing at the time of the incident. Therefore, we have to capture information about the processes that were executing at the time of the incident.
If we want to truly understand what the program is, we will have to collect it from memory. This means that we will need to capture evidence from the Random Access Memory (RAM). However, the information located here can be lost if there is a brownout or if the power is turned off. Malware has become incredibly advanced and can stage its different payloads, meaning even if we have a sample of the malware to execute in the sandbox, we cannot truly understand what it was doing on the host without analysing it directly in the memory.