CSIRT, I have a project for you. We have a big network and we’re definitely getting hacked constantly. Your group needs to develop and implement security monitoring to get our malware and hacking problem under control.

If you’ve been a security engineer for more than a few years, no doubt you’ve received a directive similar to this. If you’re anything like me, your mind probably races a mile a minute thinking of all of the cool detection techniques you’re going to develop and all of the awesome things you’re going to find.

I know, I’ll take the set of all hosts in our web proxy logs doing periodic POSTs and intersect that with…

STOP!

You shouldn’t leap before you look into a project like this.

You can put any competent security engineer in front of a bunch of network and host logs and they’ll be able to find dozens of infections in the first day. Presumably your organization is big enough to need more than one security investigator/analyst. How are you going to organize and maintain your monitoring over the long term? If you think you can just install a bunch of IDS boxes and dump the data into a SIEM to extract actionable data out of your network events, your monitoring will be ineffective. You need a way to maintain and update your monitoring over the long term. You need a way of integrating security intelligence / “Indicators of Compromise” into your monitoring. You need to document your monitoring and how you will act on hits. In short, you need a network security monitoring and incident response playbook. At Cisco, our CSIRT group has one. Let me tell you about it.

It’s no secret, security is inherently complicated with a large number of disparate data sources and types of security logs and events. Speaking as an engineer facing so much complexity, my tendency is to build a monitoring system so hacked together only MacGyver could appreciate and maintain it. If your company is anything like Cisco, you have a huge amount of network complexity like overlapping RFC 1918 addresses, offices in dozens of countries, business units doing their own thing, and IPsec tunnels, among other things. At the same time, surely you’re collecting IDS events, AV logs, NetFlow, client http requests, server syslog, authentication logs, and many other valuable data sources. Beyond just your data sources, you also have intelligence sources from the broader security community as well as in-house developed security knowledge and other indicators of hacking and compromise. With such a broad landscape of security data sources and knowledge, the natural tendency is towards complex monitoring systems. Of course complexity is the enemy of reliability and maintainability, so something must be done to combat the inexorable drift.

Enter the Playbook

Our Playbook is our answer to this complexity. At its heart, it’s a collection of “plays” that each generate a report from some set of data sources. The thing about plays that makes them so useful is that they aren’t just some complex query or code to find bad stuff.

By building the documentation into the play we’ve directly coupled the motivation for the play, how it gets analyzed, the specific query for it, and any additional information needed to both run the play and act upon the report results. To be clear, the Playbook is for organizing and documenting security monitoring. It isn’t an incident response handbook or a policy document or any other type of security document or handbook. The Playbook may reference things like the Incident Response Handbook or Acceptable Use Policy, but it isn’t a replacement for these.

At the heart of it, every play contains a set of sections:

Report ID

Report Type with Name

Objective Statement

Result Analysis

Data Query/Code

Analyst Comments/Notes

I’ll discuss each of these.

Report ID and Report Type with Name

Our report IDs use a Dewey Decimal-like numbering system where the leading digit indicates the data source. 1 is for IDS events, 3 is for the transparent web proxy logs, 6 is for our HIPS logs, and so forth. We’ve padded several digits after the leading digits with 0s for room for expansion and subcategories for future data sources and feeds. The remaining portion of the report ID is a unique, mostly incrementing, report number.

The remaining portion of the report name contains the Type of report (currently “investigative” or “high fidelity”), the Event Source (which matches the leading digit in the ID), the report Category (for example Malware or APT or Policy), and a sentence fragment Description.

For example: 600002-INV-HIPS-MALWARE: Detect surreptitious / malicious use of machines for Bitcoin mining

Objective Statement

The objective statement is an English-language description of the “what” and “why” of a play. The target audience for objective statements is not security or network professionals. The objective statements are intended to provide background information and good reasoning for why the play exists. Ultimately the goal of the objective statement is to describe to a layperson what a play is looking for on the network and leave them with a basic understanding of why the play is worthwhile to run. The objective shouldn’t be too detailed with specifics and shouldn’t contain information or malicious indicators like IP addresses, malware URLs, binary names, file hashes, or any other indicator not needed to understand the high-level details of a play.

Here is an example objective:

Today malware is a business. Infecting machines is usually just a means to financial ends. Some malware sends spam, some steals credit card information, some just displays advertisements. Ultimately the malware authors need a way of making money by compromising systems.

With the advent of Bitcoin, there is now an easy way for malware authors to directly and anonymously make use of the computing power of infected machines for profit.

This report looks for processes that appear to be participating in the Bitcoin network that don’t obviously announce that they are Bitcoin miners.

Result Analysis

The result analysis section is written for a junior-level security engineer and provides the bulk of the documentation and training material needed to understand how the data query works, why it’s written the way it is, and how to interpret and act upon the results of the query. This section discusses the fidelity of the query, what expected true positive results look like, the likely sources of false positives, and how to prioritize the analysis and tune out or skip over the false positives. The analysis section can vary a lot from play-to-play because it’s very specific to the data source, how the query works, and what the report is looking for.

One of the main goals of the analysis section is to help the security engineer running the play and looking at report results act on the data. To facilitate smooth handling of escalations when actionable results are found, the analysis section must be as prescriptive as possible. It must describe what to do, all of the related/interested parties involved in an escalation, and any other special handling procedure.

For high fidelity plays, every result is guaranteed to be a true positive, so the analysis section focuses more on what to do with the results rather than the analysis of them.

Data Query / Code

The query portion of the play is not designed to be stand-alone or portable. The query is what implements the objective and produces the report results, but the specifics of how it does that just don’t matter. All of the details of the query needed to understand the results are documented in the analysis section. Any remaining under-the-hood details are inconsequential to the play and the analyst processing the report results. Queries can sometimes be rather complex due in part to being specific to whatever system the data lives in. For us that’s primarily Splunk.

Analyst Comments / Notes

We manage our Playbook using Bugzilla. Using a bug/ticket tracking system like Bugzilla allows us to track changes and document the motivation for those changes. Any additional useful details of a play that don’t belong in the aforementioned sections end up in the comments section. For a given objective, there are often a number of ways to tackle the idea in the form of a data query. The comments allow for discussion among the security engineers about various query options and the best way to approach the play objective. The comments also provide a place for clarifications and remarks about issues with the query or various gotchas.

Most plays need occasional maintenance and tuning to better handle edge cases and tune out noise or false positives. The comments allow the analysts processing reports to discuss tweaks and describe what is and isn’t working about a report. By keeping all of the notes about a play as addendums, it’s possible to read the evolution of the play. This enables us to keep the Playbook relevant long term.

The Playbook in Practice

One of the biggest benefits to our Playbook is that it’s very flexible. Even though information security is a constantly changing field, the Playbook approach enables us to keep up. Instead of being a rigid framework that stifles creativity, the open-ended nature of play objectives allows our security engineers to document ideas and explore ways of achieving the objective. We’re comfortable with innovative pie-in-the-sky objectives because the notes allow us to iteratively improve the query and analysis to zero in on the objective. Worst case, we have to reject or retire a play because we can’t find a way to reasonably achieve the objective with our data sources. Plays tend to be created by one person but improved democratically by anyone on the team with valuable input. In the cases where we have competing ideas and can’t reach a consensus, we tend to fork the play and do both (provided the approaches aren’t completely redundant). The iterative, democratic approach to plays ensures that the Playbook is a living document always up the task of handling tomorrow’s security challenges.

Hi Houndekindo, it's not a mitigation tool but a response tool for enabling the team to quickly identify and contain a cyber incident. Time is of the essence when responding to cyber incidents / threats. By having a playbook it provides steps to take when a particular incident has occurred, what to do, and consistent response procedures. In other words it's a critical component to cyber-security preparedness for the team and company.
On Incident tracking, you need a method for categorizing incidents to identify trends, lessons learned, and use that information to drive risk management practices.

Some of the individuals posting to this site, including the moderators, work for Cisco Systems. Opinions expressed here and in any corresponding comments are the personal opinions of the original authors, not of Cisco. The content is provided for informational purposes only and is not meant to be an endorsement or representation by Cisco or any other party. This site is available to the public. No information you consider confidential should be posted to this site. By posting you agree to be solely responsible for the content of all information you contribute, link to, or otherwise upload to the Website and release Cisco from any liability related to your use of the Website. You also grant to Cisco a worldwide, perpetual, irrevocable, royalty-free and fully-paid, transferable (including rights to sublicense) right to exercise all copyright, publicity, and moral rights with respect to any original content you provide. The comments are moderated. Comments will appear as soon as they are approved by the moderator.