While incident response is reactive in nature, there are steps DFIR teams can proactively take to ensure that if the worst happens they will be prepared to respond. In this blog post, we will provide three key recommendations that will help organizations improve their
ability to efficiently and effectively respond if an incident occurs. Specifically, we will discuss:
- Configuring cloud logging
- Creating accounts and resources for responders
- Understanding the environment
Although this is by no means an extensive list, it is a starting point that can exponentially affect how your organization handles incidents. You should also take time to identify and implement proactive controls aimed at defending against threats, but our focus in this post will be on preparing for when those proactive controls fail
1. Configuring cloud logging
When it comes to the cloud, our biggest source of evidence is logs. Regardless of whether its sign-in logs, audit logs, resource logs, or any other number of logs available, logging is what provides us visibility over activity in the environment. Without logging, it becomes very difficult to investigate incidents. While it is outside the scope of the post to go into the specific settings and configurations per cloud service provider (CSP), this post provides high-level guidance that can be used to help your organization understand the importance, and ultimately properly implement cloud logging based on a few key considerations. Specifically, we will discuss the following two topics related to logging:
- Enabling non-default events
- Storing and centralizing logs
Before diving into each of these categories, it is important to recognize that enabling additional services and/or turning on additional service features (i.e logging additional events, increasing log retention periods, even simply enabling logging for a service) may incur additional charges from the service provider. Pricing details and calculators are provided by most CSPs if you want to determine roughly how much investment is required for some of these actions, so we won’t go in-depth on incurred charges in this post.
Enabling Non-Default Events
Most cloud providers have a set of logs for each service that are enabled by default yet may offer additional, often valuable, logs that are disabled by default. There is considerable variation between which CSP services have which logs enabled by default, but generally the pattern is that events related to managing or administering the environment are enabled by default, while events related to specific resource activity are likely disabled by default.
Some example of management logs on by default are:
Sign in activity |
Creating new users |
Assigning roles or policies |
Creation or deletion of a virtual machine (or other resources) |
Some examples of resource activity disabled by default are:
Data read or write activity |
Flow logs |
Application or OS logs from VMs |
It is important to understand which logging events you have enabled vs. disabled as if disabled, you could be faced with a major visibility gap during an incident. For example, let’s say your organization has sensitive data in an AWS S3 bucket, and you’re tasked with identifying whether that data was exposed as part of a breach. If you haven’t proactively enabled S3 data event or server access logging, which are not on by default - you will have a significant gap in your visibility and be unable to conclusively determine whether or not data was exposed.
That’s not to say that all non-default events should be turned on. That is unrealistic from a data processing and cost perspective. Instead, your organization needs to evaluate what resources need to be monitored at what logging level and apply policies based on your requirements. Start by focusing on enabling additional logging for sensitive data and resources.
Storing and Centralizing Logs
On the topic of storing and, preferably, centralizing logs, there are a few aspects that we need to discuss. First is discussing how you are going to store your logs. Many of the CSPs provide you with multiple methods by which you can access logs. For example, Azure will allow you to view logs in the Azure Portal, send them to a Log Analytics Workspace or Storage Account, or export them via EventHub or Graph API. All methods have their pros and cons and it's up to your organization to decide which method is best for them.
Location, location, location. Ideally, logs from all data sources will be centralized into a single location. This is particularly critical when investigating incidents, as it allows for quicker correlation and identification of related events and significantly reduces response time. having to go to multiple locations and/or services to find logs creates a gap in visibility and increases analyst overhead.
There are both in-cloud options for this provided by each CSP as well as the option to leverage APIs or other cloud services to export CSP logs to an external service such as a SIEM or log aggregation tool. One thing to keep in mind when exporting logs or leveraging cloud native log aggregation services is that both methods will result in additional charges, once again emphasizing the importance of identifying which logs are of value to your organization, and what your retention policies should be .
It's critical to consider your log retention period on a service by service basis. For example, using the default logs for any of the CSPs typically comes with a restricted retention period, sometimes as low as 30 days. Many incidents involve long dwell times and having logs with a short retention period can greatly impact your ability to see the whole picture and get to the root cause of an incident. For that reason, it’s ideal to increase log retention by rerouting logs to another service or storage location. However, as previously mentioned, the longer the log retention period, the more data you generate, transmit, and store, and hence the more charges you will incur.
2. Creating accounts and resources for responders
If an incident occurs in the cloud, it's likely that incident responders will need access to a variety of services and service data to investigate. You do not want to waste precious time during an incident trying to get your IR team access to the necessary resources , Proactively creating IR specific accounts not only reduces the risk of over-provisioning an account’s permissions during the stress of an incident - potentially leading to further compromise - it allows for the scoping out of needs well beforehand, and hence adherence to the principle of least privilege. .
Typically, the permissions needed by DFIR teams are going to fall somewhere between those of network admins and global admins. Depending on how your architecture is structured, The IR team will likely need visibility over your entire organization (all projects, subscriptions, management groups, OUs, etc.). The level of permissions required, however, will vary; ie: read-only for some services (the ability to read logs), while write permissions may be required for other services (such as the ability to create snapshots). This is best determined well ahead of time with a table top exercise, wherein the goal is to Identify the steps that will potentially be involved during an IR engagement and ensure that the responders will have all the necessary permissions needed to take action.
In another blog post on cloud DFIR, we talked about the capabilities that the cloud provides to incident responders. One of those was the ability to run forensic workstations in the cloud. Not only does this reduce egress costs when working with cloud data, but it also prevents responders from being limited by the hardware in their possession. To take advantage of this capability to its fullest, we recommend creating a forensic VM image ahead of time that has all the tools required to carry out investigations. Even more effective would be the use of infrastructure-as-code (IaC) templates to deploy all resources required for a DFIR lab, such as the VMs, networking requirements, permissions, and more. All three major CSPs provide this type of service, as listed below:
First, determine what requirements you have for a forensic workstation, both in terms of compute power and installed software, as well as what connectivity and permissions are needed in the scope of the environment. Once you have this information, develop an infrastructure-as-code template based on the requirements, which will allow any responder with the right permissions to spin up their own forensic workstation in minutes vs hours.
3. Understanding the environment
This concept may appear very broad but its importance cannot be overstated, as while fundamental similarities exist - DFIR in the cloud can be very different than DFIR elsewhere . One of the challenges related to doing DFIR for cloud environments vs. on-premise is that responders need to have a thorough understanding of cloud concepts in addition to understanding organization-specific details. Responders who are assigned to cloud incidents without a true understanding of how the cloud works may not be able to successfully perform an investigation, or remediate threats. There are plenty of free resources online that responders can use to get up to speed, as well as high quality paid training opportunities, such as FOR509: Enterprise Cloud Forensics and Incident Response.
After gaining an understanding of the cloud as a whole and any concepts specific to the CSP(s) you use , seek to grow your understanding of how the organization leverages the cloud. Again, going into an incident without having ever worked in your organization’s cloud environment is going to make it very challenging to interpret activity you are seeing in logs and know what risks your organization may be vulnerable to. Outside of engagements, the DFIR team should connect with your cloud administrators and seek to increase their understanding of how the organization’s cloud environment is structured, how permissions are assigned , what policies exist and how they are enforced, which cloud services are used, and any other information that will be needed during response. This all should be documented in a place accessible to responders during incidents and for organizational reference.
Conclusion
In this blog post, we’ve provided specific steps that can be taken to significantly improve your organization's cloud incident response efficiency and efficacy. We focused on expanding cloud logging capabilities, providing access and resources to responders, and developing an understanding of your cloud environment. This list is by no means exhaustive, and is instead meant to provide a starting point for your cloud DFIR journey, and strengthen your organization’s overall security posture.