Importance Of Structured Incident Response Process Anton Chuvakin, Ph.D., GCIA, GCIH, GCFA WRITTEN: 2004 Example..................................................................................................................1 Introduction.............................................................................................................2 SANS Six Step Incident Response Methodology...................................................4 Incident Response Tools........................................................................................6 Example Corporation – Worm Incident Revisited...................................................7 Common Mistakes of Incident Response.............................................................10 Conclusion............................................................................................................12 Glossary................................................................................................................13 DISCLAIMER: Security is a rapidly changing field of human endeavor. Threats we face literally change every day; moreover, many security professionals consider the rate of change to be accelerating. On top of that, to be able to stay in touch with such ever-changing reality, one has to evolve with the space as well. Thus, even though I hope that this document will be useful for to my readers, please keep in mind that is was possibly written years ago. Also, keep in mind that some of the URL might have gone 404, please Google around.
Example Right around lunchtime, a helpdesk operator at Example Corporation -- a medium-sized manufacturing company – receives calls from several users all reporting computer failures and slow network response. Example Corporation’s security infrastructure includes firewalls, intrusion detection systems, anti-virus software and operating system logs, all technology investments from the “boom” years. The helpdesk operator opens a new trouble ticket in Remedy, describing the users’ problems and recording the machines’ hostnames. Other unrelated support issues continue to pile up and the operator’s attention is directed elsewhere. Meanwhile, the worm, which caused the above laptop problems, continues to spread throughout Example’s network. The malicious software made its way into Example after being brought in by one of the sales people who often plugs his laptop into untrusted networks, such as hotels and customer environments, outside the company. With most of the Example’s security monitoring capabilities deployed in a DMZ and on a network perimeter, the remainder of Example’s vulnerable corporate assets are largely unguarded and unwatched. Thus, as the worm wends its way around Example’s enterprise, the company security team is not even aware of a developing disaster.
Page 1
Soon, network traffic generated by the worm has increased dramatically, as more machines become infected and start spewing copies of the same worm. When the infection reaches critical levels and starts to affect the performance of monitored servers, the security team is notified by a flood of pager alerts… chaos ensues. While some try installing anti-virus updates other apply firewall blocks (preventing not only worm scanning, but also the download of updates) and yet others try to scan for vulnerable machines that contributes to the network-level denial-of-service. After hours of uncoordinated activities, most of the worm-carrying machines are discovered and the re-infection rate is brought under control. A management requested investigation begins and computer forensic consultants are brought in. However, what remained of the initial infection evidence was either destroyed or extremely hard to find due to “mitigation” activities that were implemented. No one remembered the original Remedy incident recorded by the helpdesk operator since the helpdesk system was not deemed relevant for security information. The investigation was able to conclude only that the malicious software was brought in from outside the company -- the specific initial infection vector was never determined. The financial and technological damage is easy to see. And yet, the recurring security incident described above shows what happens when companies lack a central point from which to manage security incidents.
Introduction Security professionals learn to constantly chant the mantra “prevention-detectionresponse.” Each of these three components is known to be of crucial importance to the organization’s security posture. However, unlike detection and prevention, the response is impossible to avoid. While it is not uncommon for the organizations to have weak prevention and nearly non-existent detection capabilities, response will have to be there since the organization will often be forced into response mode by the attackers (be it the internal abuser, omnipresent “script kiddie” or the elusive “uber-hacker”) or their evil creations (viruses, worms and spyware). The organization will likely be made to respond in some way after the incident has taken place. Even in cases where ignoring the incident that happened might be the chosen option, the organization will implicitly follow a response plan, even if as ineffective as to do nothing. In light of this, being prepared for incident response is likely to be one of the most cost effective security measures the organization takes. Timely and effective incident response is directly related to decreasing the incident-induced loss to the organization. It can also help to prevent an expensive and hard-to-repair reputation damage, which often occurs following the security incident. Several industry surveys have identified that public company's stock price may plunge
Page 2
several percent as a result of a publicly disclosed incident (http://www.securityfocus.com/news/11197). Incidents that are known to wreak catastrophic results upon the organizations may involve malicious hacking, virus outbreaks, economic espionage, intellectual property theft, network access abuse, theft of IT resources and other policy violations. Most of us in the security industry are already familiar with the traditional challenges we face every day… too much security data to sift through, too many false alarms to deal with, and not enough budget or resource to handle an evergrowing number of security incidents. One additional and often overlooked challenge involves the security management process itself. Largely ignored in many of today’s IT enterprises, a clearly defined, documented, and repeatable incident management process defined in an incident response plan is fundamental to ensuring fast and accurate handling of security incidents. Even if an explicit incident response plan is lacking, after the incident occurs the questions such as these might be asked by the company management: • • • • •
What to do now? How to put it the way it was? How to prevent recurrence? How we should have prepared? Should we try to figure who is responsible?
Answering these questions requires knowledge of your computing environment, company culture and internal procedures, implemented technical security and policy countermeasures. Effective incident response fuses together technical and non-technical resources, bound by the incident response policy, procedures and plans. Such policy should be continuously refined and improved, based on the organization's incident history, just as the main security policy should be. To build an initial incident resolution management framework one can use SANS Six Step incident response methodology. This approach was originally developed for US Department of Energy, adopted elsewhere in the US government and then popularized by the SANS Institute (http://www.sans.org/rr/whitepapers/incident/) The methodology includes the following six steps: 1. 2. 3. 4. 5. 6.
Page 3
Preparation Identification Containment Eradication Recovery Follow-Up
SANS Six Step Incident Response Methodology Overall, the SANS methodology allows an organization to give structure to the otherwise chaotic incident response workflow. The steps of the SANS methodology are both clearly defined and easy to follow, and most importantly, work in the high-stress post-incident environments for which they were designed. Following the steps is as easy as selecting and appropriately customizing the procedures for each case at hand. Using the SANS pre-defined procedures assures that an incident response workflow will become relatively painless and the crucial steps will not be missed. Additionally, such a system will facilitate both training and collaboration between various response team members, who can share the workload for increased efficiency. Finally, integrating the SANS methodology into an overall incident response planning assures today’s IT organizations that they have a comprehensive approach in-place to tackle security incidents. It also demonstrates compliance with industry “best practices”, which is sometime associated with regulatory compliance. Having a repeatable incident management process is highlighted in several recent regulations, such as HIPAA. Let’s spend just a moment reviewing a few key features of the SANS Six Step Incident Response methodology: The Preparation stage covers everything one should do before handling the first incident. It involves both technology issues, such as preparing response and forensics tools, learning the environment, configuring systems for optimal response and monitoring, as well as business issues -- such as assigning responsibility, forming a team and establishing escalation procedures. Additionally, this stage covers the steps necessary to increase a company’s security posture and thus decrease the likelihood and damage from future incidents. Security audits, patch management, employee security awareness program and other security tasks all serve to prepare the organization for incident action. Building a culture of security and a secure computing environment also serves as incident preparation. Specifically, establishing a real-time system and network security event monitoring program will help to receive early warnings about the hostile activities as well as collect evidence after the incident. Providing a single view into your security infrastructure goes a long way towards being more prepared and equipped to deal with the incidents as they occur as well as cleaning up in the aftermath. Single evidence storage allows performing sophisticated data analysis, leading to better awareness of threats and vulnerabilities. Identification is what happens first when an incident is suspected or detected. Determining whether the observed event does in fact constitute an incident (as
Page 4
defined above) is of crucial importance. Careful record keeping is very important, since such documentation will be heavily used at later stages of the response process. One should record everything that was observed in relation to the incident, whether online or in the physical environment. During this stage, it is important that people responsible for incident handling maintain the proper chain of custody (explained here http://en.wikipedia.org/wiki/Chain_of_custody as “document or paper trail showing the seizure, custody, control, transfer, analysis, and disposition of physical and electronic evidence.”). Contrary to popular opinion, this is important even when the case is never destined to end up in court. Following established and approved procedures will help the investigation that is internal to the company. Various security technologies play a role in incident identification. For example, firewall, IDS, server and application logs reveal evidence of potentially hostile activities, coming from both outside and inside the protected perimeter. Logs are often tantamount in finding the party responsible for those activities. Security event correlation is essential for high quality incident identification, due to its ability to uncover patterns in incoming security event flow. Collecting various audit logs and correlating them in near real-time goes a long way towards making the identification step of the response process less laborious. Additionally, incident identification is greatly helped by “qualifying” the IDS and other alerts using other environment context, such as system and application vulnerabilities, running applications as well as business value. Containment is what keeps the incident from spreading and thus incurring higher financial or other loss. During this stage, the incident responders will intervene and attempt to limit the damage, such as by tightening network or host access controls, changing system passwords, disabling accounts, etc. While completing the above steps, one should make every effort to keep all the potential evidence intact, balancing the needs of system owners and incident investigators. The backup of affected systems is also essential at this step. This is done to preserve the system for further investigation as well as remediation. The important decision on whether to continue operating the affected assets should be made by the appropriate authorities during this stage. Automated containment measures, such as firewall blocking, system reconfiguration or forced file integrity checks, and the use of intrusion preventions solution (in the inline mode) can also be used, if driven by event correlation and more intelligent analytics. However, automated containment will likely become widely accepted in the future. Eradication is the only stage when the factors leading to the incident are eliminated or mitigated. Such factors often include system vulnerabilities, unsafe system configurations, out-of-date protection software or even imperfect physical access control. Also, the non-technology controls such as building access policies or key card privileges might be adjusted at this stage. In the case of a
Page 5
hacker-related incident, the affected systems are likely to be restored from the last clean backup or rebuilt from the operating system vendor media with all applications reinstalled. Time is most critical during the eradication stage. The first response should satisfy several often conflicting criteria, such as accommodating the system owners requests, preserving evidence, stopping the spread of damage while complying to all the appropriate organization's policies. Recovery is the stage where the organization's operations return to normal. Systems are restored and configured to prevent recurrence and are returned to regular use. To insure that the newly established controls are working, the organization might want to maintain increased monitoring of the affected assets for some period of time. Return to production is always a critical step. If done too early, there is a significant risk of recurrence; if done too late, it risks upsetting the business owners. Thus, it should be clearly documented in the incident procedures during the preparation stage. Follow-Up is an extremely important stage of the incident response process. Just as the preparation stage above, proper incident follow-up helps to ensure that lessons are learned from the incident and that the overall security posture improves as a result. Additionally, follow-up is important in order to prevent the recurrence of similar incidents. Additionally, a report on the incident is often submitted to the senior management. It covers the actions taken, summarizes the lessons learned and also serves as a knowledge repository in case of similar incidents in the future. Follow-up steps often need to be distributed to a wider audience than the rest of the investigation process. Enterprise-wide security knowledge base helps to address this challenge. It will ensure that IT resource owners will be more prepared to combat future threats. To optimize the distribution of incident information, one can use various forms and templates, prepared in advanced for different types of incidents. Properly sanitized past incident cases should also be added to an organization-wide security knowledge base, in addition to the industry security resources and vulnerability knowledge. Such materials can later be used for training new incident responders as well as broader IT audience. A summary of suggested actions might also be sent to the senior management.
Incident Response Tools While people and processes are important, tools is what completes the security triangle. When the incident is suspected, the response team will need the tools to verify its status, assess damage that was incurred as well as can be occurred and then proceed to contain and recover from the incident. This involves a wide
Page 6
range of tools from intrusion detection to forensics and vulnerability management. Backup tools should also not be overlooked. Tools helpful for incident management can be organized as such: Tools Evidence collection and storage Data analysis and forensics Collaboration Backup Documentation
Common uses during incident response System and security logs, audit trails, disk images, email and other communication Correlation, searching and reporting, forensics discovery activities Incident team communication, workflow, team management Evidence preservation, “known good” configuration retention, user data recovery Actions logged for audit and improvement, reporting, incident team performance measurement, lessons learned, future team training
Some tools are helpful in more than one of the above category. For example, a Security Information Management (SIM) solution often holds most of the evidence from the scene of the information security incident. Incident handling is a natural SIM product functionality aimed at gathering and organizing security event data around incidents and also enforcing proper response workflow in order to facilitate effective and prompt response to security incidents. Specifically, a SIM can • Facilitates the effective handling process • Integrates evidence storage and analysis • Enforces proper access control to evidence • Enables team collaboration • Simplifies resolution monitoring and reporting • Makes security measurable In general, it establishes a single control point of the security response capabilities by combining the major potential evidence storage with the investigative platform. Other tools that an incident team needs to be very familiar with include disk image forensics tools, covering the whole lifecycle from making a forensics copy of the suspect’s workstation to final evidence presentation to an internal authority or law enforcement. Those tools do require significant training, especially if used for cases where court trial is likely.
Example Corporation – Worm Incident Revisited
Page 7
A network helpdesk operator receives calls from several users – all reporting computer failures and slow network response. Using a newly established process, a trained team and right tools, an incident case is opened according to the plan and user complaints from that department are summarized and presented to all relevant parties, including the security team contact. The affected machines together with the information on their owners are also added to corresponding case fields. The operator then assigns the case to the security event monitoring team, as mandated by his instructions, derived from the incident plan. Upon receiving the assignment through the case management system, a monitoring team member run several queries searching for suspicious events to and from the affected machines – all as part of the incident identification procedure defined by the company. He discovers that a network IDS has detected an email worm being transmitted from outside the environment. The monitoring team member shares the incident case with the security analyst team, running the intrusion detection, so they can verify the impact of the IDS events, based on the affected asset business role and importance. Many events reported by the anti-virus systems running on some of the user's desktops were also reported from the affected IP addresses. As a next step, an analyst selects a Containment procedure from the knowledge base, which involves quarantining the infected machines by applying a firewall rule to prevent the spread of the worm. The procedure is added to the incident case and then implemented. Next, it is necessary to clean the infected PCs. The Mitigation procedure involves installing and running full scan using a freshly updated copy of anti-virus software. The security engineering team together with security analyst team verifies compliance of the newly installed anti-virus system with the company's anti-virus policy. The recommended Follow-up procedure includes a mandated company-wide desktop anti-virus deployment from a dedicated server. The procedure is then submitted for management approval and, once approved, the remediation team assures that the anti-virus software is pushed out to all company desktop PC’s and the incident case is closed. Here is another example of how a company with a well-tuned incident response process handles an attack against the web server. A security analyst on duty received an email notification when a correlated event on a successful attack was triggered by SIM solution. An analyst has discovered that a real-time correlation rule was matched by a series of events directed against the auxiliary web server. By logging into their SIM and running a report, the analyst has found out that the triggered rule aims to detect high-severity attacks against the web server, which
Page 8
are preceded by the reconnaissance activity, such as a server version query. The web server was first probed for its type and version and later attacked by a known exploit detected by the network intrusion detection system. The company security monitoring procedure mandated that such be investigated. Thus, the analyst clicked on the correlated event in the corresponding report and chose to add it to a new incident case. He then added a note saying that he received an email notification and started the investigation in accordance with the security procedure. After the case was registered by the system, the analyst proceeded to investigate the related events. He opened the report to view the raw security events that triggered the correlation. Such events included probes against multiple servers followed by an attack. He looked at the attack details and found out that the IDS signature for the exploit matched the server type and the operating system. He added all the related events to the incident case as well. Further, he run an query to look for more traces of the same attacker’s IP address (the source) in the event database. Multiple entries indicative of scanning, denied connections on the firewall and TCP port 80 attempts across the enterprise were discovered. The report results were also added to the incident case. At that stage it was obvious that a consistent attack was in progress. The note was added to the case Identification section saying that the incident is confirmed and several servers might have been impacted. The analyst then searched all events involving the attacker web server. No suspicious activity has originated from it. However, since the server was not a business critical asset, it was possible to take it offline for investigation. This decision was recorded in the Containment section of the incident case and the server was taken offline. The detailed server investigation that followed has not revealed any signs of a successful compromise. However, the server logs contained evidence of a multiple failed exploit attempts. The server was also found missing several critical patches. Their lack was apparently not detected by the attacker. It was decided to patch the server before the regular maintenance window and to return it online. It was also decided to increase the logging level on the server. The respective note was made in the Mitigation section of the incident case and the above steps were performed. After the server was returned into operation, the analyst has assigned the case to the incident manager who had the authority to review the performed steps and to close the case. The manager added several notes to the follow-up section, which
Page 9
suggested that servers in that subnet be scanned for vulnerabilities more often. The case was then closed.
Common Mistakes of Incident Response While many organizations are on the path towards organizing their incident response, many pitfalls lay in wait for them on the path to incident management nirvana. This section summarizes several mistakes that companies make in their security incident response. # 1 Not having a plan The first mistake is simply not creating an incident response plan before incidents start happening. Having a plan in place (even a plan that is not well-thought) makes a world of difference! Such plan should cover all the stages of incident response process from preparing the infrastructure to first response all the way to learning the lessons of a successfully resolved incident. If you have a plan, then after the initial panic phase, ('Oh, my, we are being hacked!!!') you can quickly move into a set of planned activities, including a chance to contain the damage and curb the incident losses. Having a checklist to follow and a roster of people to call is of paramount importance in a stressful post-incident environment. To jump-start the planning activity one can use a ready-made methodology, such as SANS Institute 6-step incident response process, covered above. With a plan and a methodology your team will soon be battle hardened and ready to respond to the next virus faster and more efficiently. As a result, you might manage to contain the damage to your organization. # 2 Failing to increase monitoring and surveillance The second mistake is not deploying increased monitoring and surveillance after an incident has occurred. This is akin to shooting yourself in the foot during the incident response. Even though some companies cannot afford 24/7 security monitoring, there is no excuse for not increasing monitoring after an incident has occurred. At the very least, one of the first things to do after an incident is to crank up all the logging, auditing and monitoring capabilities in the affected network and systems. This simple act has the potential to make or break the investigation by providing crucial evidence for identifying the cause of the incident and resolving it. It often happens that later in the response process, the investigators discover that some critical piece of log file was rotated away or an existing monitoring feature was forgotten in an 'off' state. Having plenty of data on what was going
Page 10
on in your IT environment right after the incident will not just make the investigation easier, it will likely make it successful. Another side benefit, is that increased logging and monitoring will allow the investigators to confirm that they indeed have followed the established chain of custody #3. Being unprepared for a court battle The third mistake is often talked about, but rarely avoided. Some experts have proclaimed that every security incident needs to be investigated as if it will end up in court. In other words, maintaining forensic quality and following the established chain of custody needs to be assured during the investigation. Even if the case looks as if it will not go beyond the suspect's manager or the human resources department (in the case of an internal offense) or even the security team itself (in many external hacking and virus incidents), there is always a chance that it will end up in court. Cases have gone to court after new evidence was discovered during an investigation, and, what was thought to be a simple issue of inappropriate Web access became a criminal child pornography case. Moreover, while you might not be expecting a legal challenge, the suspect might sue in retaliation for a disciplinary action against him or her. A seasoned incident investigator should always consider this possibility. In addition, following a high standard of investigative quality always helps since the evidence will be that much more reliable and compelling, if it can be backed up by a thorough and well-documented procedure. #4. Putting it back the way it was The fourth mistake is reducing your incident response to "putting it back the way it was". This often happens if the company is under deadline to restore the functionality. While this motive is understandable, there is a distinct possibility that failing to find out why the incident occurred will lead to repeat incidents, on the same or different systems. For example, in the case of a hacking incident, if an unpatched machine that was compromised is rebuilt from the original OS media, but the exploited vulnerability is not removed, the hackers are very likely to come back and take it over again. Moreover, the same fate will likely befall other exposed systems. Thus, while returning to operation might be the primary goal, don’t lose sight of the secondary goal: figuring out what happened and how to prevent it from happening again. It feels bad to be on the receiving end of the successful attack,
Page 11
but it feels much worse to be hit twice by the same threat and have you defenses fell in both cases. Incident response should not be viewed as a type of "firefighting" although you’d fight plenty of fires in the process. It can clearly help in case of a fire, but it can also help prevent fires in the future. #5. Not learning from mistakes The final mistake sounds simple, but it is all too common. It is simply not learning from mistakes! Creating a great plan for incident response and following it will take the organization a long way toward securing the company, but what is equally important is refining your plan after each incident, since the team and the tools might have changed over time. Another critical component is documenting the incident as it is occurring, not just after the fact. This assures that the "good, the bad and the ugly" of the handling process will be captured, studied and lessons will be drawn from it. The results of such evaluations should be communicated to all the involved parties, including IT resource owners and system administrators. Ideally, the organization should build an incident-related knowledge base, so that procedures are consistent and can be repeated in practices. The latter is very important for regulatory compliance as well and will help satisfying some of the Sarbanes-Oxley requirements for auditing the controls to information.
Conclusion While the above cases are simplistic in nature they readily show the need for any security management system to have not only an incident response plan but also an integrated incident handling system to ensure complete and effective response planning deployment. Having a highly efficient plan helps organizations save money by limiting the impact on core business from security incidents and increasing the efficiency of existing security infrastructure investments. Overall, the SANS process allows one to give structure to the otherwise chaotic incident response workflow. It defines the steps that will then be followed under incidentinduced stress with high precision. In fact, many of the above steps may be built from the pre-defined procedures. Following the steps will then be as easy as selecting and sometimes customizing the procedures for each case at hand. Incident handling workflow will become more streamlined and the crucial steps will not be missed and documented properly. Using pre-defined procedures also helps train the incident response staff on proper actions for each process step. The automated system may be built to keep track of the response workflow, to suggest proper procedures for various steps and to securely handle incident evidence. Additionally, such a
Page 12
system will facilitate collaboration between various response team members, who can share the workload for increased operational efficiency. What is even more important, monitoring incident resolution activities allows the organization to implement effective security metrics. It is one thing to count number of alerts or events flowing from various sensors, but to take security assessment to the next level one needs to measure the performance of the whole security process, involving both people (such as security team members working on the incident cases) and technologies. ABOUT THE AUTHOR: This is an updated author bio, added to the paper at the time of reposting in 2009. Dr. Anton Chuvakin (http://www.chuvakin.org) is a recognized security expert in the field of log management and PCI DSS compliance. He is an author of books "Security Warrior" and "PCI Compliance" and a contributor to "Know Your Enemy II", "Information Security Management Handbook" and others. Anton has published dozens of papers on log management, correlation, data analysis, PCI DSS, security management (see list www.info-secure.org) . His blog http://www.securitywarrior.org is one of the most popular in the industry. In addition, Anton teaches classes and presents at many security conferences across the world; he recently addressed audiences in United States, UK, Singapore, Spain, Russia and other countries. He works on emerging security standards and serves on the advisory boards of several security start-ups. Currently, Anton is developing his security consulting practice, focusing on logging and PCI DSS compliance for security vendors and Fortune 500 organizations. Dr. Anton Chuvakin was formerly a Director of PCI Compliance Solutions at Qualys. Previously, Anton worked at LogLogic as a Chief Logging Evangelist, tasked with educating the world about the importance of logging for security, compliance and operations. Before LogLogic, Anton was employed by a security vendor in a strategic product management role. Anton earned his Ph.D. degree from Stony Brook University.
Glossary Security event is a single observable occurrence as reported by a security device or application or noticed by the appropriate personnel. Thus, both IDS alert and security-related helpdesk call will qualify as security events. Security incident is an occurrence of one or several security events that have a potential to cause undesired functioning of IT resources or other related
Page 13
problems. Thus, that limits our discussion to information security incidents, which cover computer and network security, intellectual property theft and many other issues. Incident response (or IR) is a process of identification, containment, eradication and recovery from computer incidents performed by a responsible security team. It is worthwhile to note, that the security team might consist of just one person, who might only be a part-time incident responder. However, whoever takes part in dealing with the incident consequences implicitly becomes part of the incident response team, even if such team does not exist as organization’s part. Incident case is a collection of evidence and associated workflow related to a security incident. Thus, the case is a history of what happened, what was done with evidence supporting both items above. It might include various documents such as reports, security event data, results of audio interviews, images files and other etc. Incident report is a document prepared as a result of an incident case investigation. Incident report might be cryptographically signed or have other assurances of its integrity. Most incident investigations will result in the report submitted to appropriate authorities (either internal or outside the company), which might contain some or even all data associated with the case. It is worthwhile to note that the term evidence is used throughout the chapter indicates any data discovered in the process of incident response.
Page 14