This chapter from Security Operations Center: Building, Operating, and Maintaining your SOC focuses on the technology and services associated with most modern SOC environments, including an overview of best practices for data collection, how data is processed so that it can be used for security analysis, vulnerability management, and some operation recommendations.

This chapter is from the book

“If all you have is a hammer, everything looks like a nail.”—Abraham Maslow

Chapter 1, “Introduction to Security Operations and the SOC,” provided a general overview of security operations center (SOC) concepts and referenced a number of technologies that offer SOC services such as vulnerability management, threat intelligence, digital investigation, and data collection and analysis. This chapter covers the details of these technologies using a generic and product-agnostic approach. This will give the fundamental understanding of how the technologies function so that these concepts can be related to products covered later in this book. This chapter also covers data collection and analysis, such as how a security information and event management (SIEM) collects and processes log data.

In this chapter, we continue to reference open source code deployments and industry-recognized architectures whenever applicable to illustrate how to deploy and customize SOC technologies. These technologies are used to develop a conceptual architecture that integrates the different SOC technologies and services. After reading this chapter, you should understand the technology and service expectations for most modern SOC environments.

Let’s start by introducing the most fundamental responsibility for a SOC: collecting and analyzing data.

Data Collection and Analysis

You need to first acquire relevant data before performing any sort of useful analysis. In the context of security monitoring, data from various sources and in different formats can be collected for the purposes of security analysis, auditing, and compliance. Data of special interest includes event logs, network packets, and network flows. You might also want sometimes to actively probe or collect content such as the static content of a web page or the hash value of a file.

Generating and capturing event logs is crucial to security operation. Events can directly or indirectly contribute to the detection of security incidents. For example, a high-priority event generated by a properly tuned intrusion detection system would indicate an attack attempt. However, a high-link-utilization event generated by a router might not be a security event by definition but could indicate a security incident when correlated with other events such as a denial-of-service (DoS) attack or a compromised host scanning your network.

Every organization operates its own unique and continuously evolving list of services and systems. However, the basic questions related to data collection and analysis remain similar across all organizations:

Which elements should you monitor?

What data should you collect and in what format?

What level of logging should you enable on each element?

What protocols should you use to collect data from the various elements?

Do you need to store the data you collect and for how long?

Which data should you parse and analyze?

How much system and network overhead does data collection introduce?

How do you associate data-collection requirements with capacity management?

How do you evaluate and optimize your data-collection capability?

Chapter 7, “Vulnerability Management,” and Chapter 9, “The Technology,” address many of these questions. The fundamental idea is not to monitor everything for the sake of monitoring, but rather to design your data-collection capability so that your technical and compliance objectives are met within the boundaries you are responsible for monitoring. For example, depending on your network topology and the elements you want to collect data from, you might want to distribute your data collectors and centralize your monitoring dashboard so that you have multiple visibility points feeding into one place to monitor all events. Another example is comparing the cost and return of investing in collecting and storing packets versus leveraging NetFlow in existing network assets for security forensic requirements.

Regardless of the technology and design that is selected, the key is that the final product not provide too much or little data. We find many failures experienced by a SOC results from poor data-collection practices. This could be caused by many factors, from blind spots based on how data is collected, to not correlating the right data such as identifying vulnerable systems that do not exist on the network. Sometimes the proper tools are enabled but their clocks are not properly synchronized, causing confusion when troubleshooting. We address these and other best practices for collecting data later in this chapter under logging recommendations.

In principle, the type of data to acquire and what the data originator supports determine the collection mechanism to deploy. For example, most enterprise-level network devices natively support the syslog protocol for the purpose of remote event logging, whereas other systems require the installation of an agent to perform a similar function.

Understanding your exact environment and identifying the elements that you acquire useful data from are the initial steps in the process of building a data-collection capability. The conceptual steps shown in Figure 2-1 represent this process. Data can be stored in a flat file, a relational database, or over a distributed file system such as the Hadoop Distributed File System (HDFS). The analyze step can use various techniques, such as statistical-based anomaly detection, deploying event correlation rules, or applying machine learning on data. Starting from the SOC design phase, you should formalize and document all processes and procedures, including your choices of technology.

After data has been collected, you can decide whether to store it, parse it, or both. Although storing data in its original format can be beneficial for the purposes of digital investigations, out-of-band security analytics, and meeting compliance requirements, it is important to note that data at this point is regarded as being unstructured, meaning the exact structure is still unknown or has not been validated. To understand the structure of the data, parsing is required to extract the different fields of an event. For the data to have any use to the organization, be aware that when storing original data, regardless of the form, you must have a repository that can accept it and tools that can later query, retrieve, and analyze it. Many factors can determine what type and how much data the SOC should store, such as legal and regulatory factors, cost to manage the stored data, and so on. Let’s look at the different types of data sources.

Data Sources

Logging messages are considered the most useful data type to acquire. Logging messages summarize an action or an activity that took place on a system, containing information related to an associated event. Depending on your environment, you might want to consider collecting logging messages from various forms of security, network, and application products. Examples of physical and virtual devices that could provide valuable logging messages include the following:

Systems used in process and control networks, such as supervisory control and data acquisition (SCADA) and distributed control system (DCS)

In addition to logging messages, you might want to collect, store, and possibly analyze other forms of data. Examples include collecting network packets, NetFlow, and the content of files such as configuration files, hash values, and HTML files. Each of these data sources provides unique value, but each has its own associated costs to consider before investing in methods to collect and analyze the data. For example, storing network packets typically has a higher cost for collecting and storage but can provide more granular detail on events than NetFlow. Some industry regulations require storage of packet-level data, making capturing packets a must-have feature. For customers looking for similar forensic data at a lower price, collecting NetFlow can be a less-expensive alternative, depending on factors such as existing hardware, network design, and so on.

To better understand the cost and value of collecting data, let’s look deeper at how data can be collected.

Data Collection

After you have an idea about the data you want to collect, you must figure out how to collect it. This section reviews the different protocols and mechanisms that you can use to collect data from various sources. Depending on what the data source supports, data can be pulled from the source to the collector or pushed by the source to the collector.

It is important to emphasize the need for time synchronization when collecting data. Capturing logs without proper time stamping could cause confusion when evaluating events and could corrupt results. The most common way a SOC enforces time synchronization across the network is by leveraging a central timing server using the Network Time Protocol (NTP). Best practice is to have all services and systems, including those that generate, collect, and analyze data, synchronize their clocks with a trusted central time server. Chapter 6, “Security Event Generation and Collection,” discusses how to best design your NTP implementation for your SOC.

The Syslog Protocol

The syslog protocol, as defined in IETF RFC 5424,1 provides a message format that enables vendor-specific extensions to be provided in a structured way, in addition to conveying event notification messages from a syslog client (originator) to a syslog destination (relay or collector). The syslog protocol supports three roles:

Originator: Generates syslog content to be carried in a message

Collector: Gathers syslog messages

Relay: Forwards messages, accepting messages from originators or other relays and sending them to collectors or other relays

Figure 2-2 shows the different communication paths between the three syslog roles, noting that a syslog client can be configured with multiple syslog relays and collectors.

Implementations of syslog use User Datagram Protocol (UDP) with port number 514 to forward events. It is also possible to implement the protocol over a reliable transport protocol, for example, Transfer Control Protocol (TCP), as per IETF RFC 3195.2 Syslog does not natively provide security protection in terms of confidentiality, integrity, and authenticity. However, these security features can be delivered by running syslog over a secure network protocol, such as Transport Layer Security (TLS) and Datagram Transport Layer Security (DTLS), as described in RFCs 5425 and 6012, respectively. These approaches might be more secure, but typically at a cost of additional overhead and the risk that some systems might not offer support for these protocols. It is recommended to review the product’s configuration guide to verify possible performance impacts and capabilities before implementing certain features.

Syslog is generally supported by network and security solutions for the purpose of event logging. UNIX and UNIX-like operating systems support syslog through the use of an agent such as rsylog and syslog-ng. Similarly, Microsoft Windows platforms require the installation of an agent to forward events in syslog format.

Regardless of the syslog client, you need to configure at least the following parameters:

Logging destinations: The collector, relay IP addresses, or hostnames. Depending on the implementation, the originator can forward syslog messages to one or more destinations.

Protocol and port: Typically these are set to UDP and port 514 by default. The option of changing this setting is implementation dependent.

Logging severity level: Can be a value ranging from 0 to 7, as shown in Table 2-1.

Table 2-1 Logging Severity Levels

Level

Severity

0

Emergency: System is unusable.

1

Alert: Action must be taken immediately.

2

Critical: Critical conditions.

3

Error: Error conditions.

4

Warning: Warning conditions.

5

Notice: Normal but significant condition.

6

Informational: Informational messages.

7

Debug: Debug-level messages.

Logging facility: A value between 0 and 23 that could be used to indicate the program or system that generated the message. The default value assigned to syslog messages is implementation specific. For example, you can assign logging facility values to categorize your events. Table 2-2 shows an example of assigning facility values to asset categories. Other approaches could be designed based on your environment and requirements. The severity and logging facility values could be combined to calculate a priority value of an event, influencing the post-event actions to take.

Depending on your setup and requirements, configuring other parameters beyond this list might be required. For example, the SOC may want more granular data by selecting which operating system or application events to log and forward.

Let’s look at a few examples that demonstrate how to configure a syslog client. Example 2-1 shows how to configure a Cisco IOS-based router to forward events by specifying the logging destinations, level, and facility. Note that there are many other parameters available for syslog beyond what we used for these examples. You can find many comprehensive sources available on the Internet that provide a list of available parameters, such as http://www.iana.org/assignments/syslog-parameters/syslog-parameters.xhtml.

Example 2-1 Configuring a Cisco IOS Router for Syslog

With the configuration in Example 2-1, the router would generate sample messages similar to what is shown in Example 2-2. The log messages in Example 2-2 are for CPU and link status updates. Some administrators would consider these messages easy to read at an individual level. Now imagine receiving thousands or even millions of such messages per day from various network device types, each with a unique message structure and content. A firewall is a good example of a network security device that would typically generate a large number of logging messages, overwhelming a security administrator who operates a basic log collection tool.

Next, let’s look at remote logging of Linux distribution messages. Remote logging of these events can be achieved by running syslog daemons. Examples are the syslogd and the use of commercial and open source logging daemons such as rsyslog or syslog-ng. In the case of Linux, most operating system log files, such as the ones shown in Example 2-3, are located in the /var/log/ directory. For CentOS (Community ENTerprise Operating System) using rsyslog, the, syslog configuration is maintained in /etc/rsyslog.conf, shown in Example 2-4. Once again, these logs might be able to be interpreted individually, but sorting through a large number of these types of log events would prove cumbersome for most administrators.

Example 2-4 rsyslog.conf Sample Configuration

# Forward all messages, generated by rsyslog using any facility and
# priority values, to a remote syslog server using UDP.
# By adding this line and keeping the default configuration, the logs
# will be stored on the client machine and forwarded to the log
# server. To limit the log messages sent by rsyslog, you can specify
# facility and priority values.
# Remote host as name/ip:port, e.g. 192.168.0.1:514, port optional
*.* @log_serever
# You can use @@ for TCP remote logging instead of UDP
# *.* @@log_serever

Example 2-5 Sample Linux Syslog Messages for SSH Access

Pay attention to the log rotation settings for syslog files that are maintained locally on your system. In the case of CentOS, for example, the log rotation settings are maintained in the /etc/logrotate.d directory.

A syslog relay or collector must be ready to receive and optionally process (for example, parse, redirect, and/or enrich) logging messages as required. Your choice of the logging server is driven by a number of factors, such as your technical requirements, skill set, scalability of the platform, vendor support, and of course, cost of acquisition and operation. In addition to commercial log management tools such as Splunk and HP Arcsight ESM, a growing number of open-source code implementations are available, such as graylog23 and logstash4 (part of the Elasticsearch ELK stack5).

Although some SIEM products manage security events, they might not be made for long-term event storage and retrieval. The reason why is that some SIEMs’ performance and scalability are limited when compared to dedicated log management platforms such as Splunk or Logstash, especially as the amount of data they store and process increases. This is due to how legacy SIEM tools store and query events, which in most cases means the use of a relational database infrastructure. Note that some SIEM vendors are evolving their approach of managing events and deploying big data platforms for their data repository. SIEM vendors that have not made this move are sometimes referred to as legacy.

Logging Recommendations

Enabling logging features on a product can prove useful but also have an associated cost on performance and functionality. Some settings should be required before enabling logging, such as time synchronization and local logging as a backup repository when the centralized logging solution fails. When designing and configuring your syslog implementation, consider the following best practices before enabling logging:

In the context of security operation, log events that are of business, technical, or compliance value.

Configure your clients and servers for NTP, and confirm that clocks are continuously being synchronized.

Time stamp your log messages and include the time zone in each message.

Categorize your events by assigning logging facility values. This will add further context to event analysis.

Limit the number of collectors for which a client is configured to the minimum required. Use syslog relays when you require the same message to be forwarded to multiple collectors. Syslog relays can be configured to replicate and forward the same syslog message to multiple destinations. This scenario is common when you have multiple monitoring platforms performing different tasks such as security, problem management, and system and network health monitoring.

Baseline and monitor the CPU, memory, and network usage overhead introduced by the syslog service.

Have a limited local logging facility, in file or memory, so that logs are not completely lost if the syslog collector is unavailable, such as in the case of network failure.

On a regular basis, test that logging is functioning properly.

Protect your syslog implementation based on evaluating the risk associated with syslog not providing confidentiality, integrity, or authenticity services.

Ensure that log rotation and retention policies are properly set.

Protect files where logs are stored: Restrict access to the system, assign proper files access permissions, and enable file encryption if needed. Read access to log files must be granted only to authorized users and processes. Write access to log files must be granted only to the syslog service. Standard system hardening procedures could be applied to operating systems hosting your logging server.

Logging Infrastructure

There are other elements to consider when designing a logging infrastructure. These include the type of data being received, expected storage, security requirements, and so on. Here are some factors that will influence how you should design your logging infrastructure:

The logging level for which your systems are configured. Remember, configuring higher severity levels results in generating more logging messages. For example, configuring a firewall for severity level 6 (information) would result in the firewall generating multiple events per permitted connection: connection establishment, termination, and possibly network address translation.

The amount of system resources available to the syslog client and server in comparison to the number of logging messages being generated and collected. An environment that generates a large amount of logging data might require multiple logging servers to handle the volume of events.

The per-device and aggregate events per second (EPS) rates. This is closely related to the device type, available resources, logging level, security conditions, and the placement of the device in the network. You must consider the expected EPS rate in normal and peak conditions usually seen during attacks. Chapter 6 provides best practices for calculating EPS rates.

The average size (in bytes) of logging messages.

The amount of usable network bandwidth available between the logging client and the logging server.

Protecting syslog messages using secure network protocols such as TLS and DTLS introduces additional load that must be accounted for.

The scalability requirements of the logging infrastructure, which ideally should be linked to capacity planning.

Consider collecting logging messages using an out-of-band physical or logical network. Separating your management plane—for example, by using a separate management virtual LAN (VLAN) or a Multiprotocol Label Switching (MPLS) virtual private network (VPN)—is a good network and system management practice that applies to most devices and systems. You might, however, encounter cases in which a system does not support having a separate physical or logical management interface, forcing you to forward logging messages in-band.

Telemetry Data: Network Flows

Every network connection attempt is transported by one or more physical or virtual network devices, presenting you with an opportunity to gain vital visibility and awareness of traffic and usage patterns. All that is needed is a way to harvest this information from existing network devices such as routers, switches, virtual networking, and access points. This essentially is enabling additional visibility capabilities from common network equipment depending on how the network traffic is collected and processed. An example is looking at traffic on switches to identify malware behavior, such as a system passing a file that attempts to spread across multiple devices using uncommon ports, giving the administrator additional security threat awareness about the environment.

In many cases, capturing and transferring network packets is not required, desired, or even feasible. Reasons could include the cost for storage of the data being captured, skill sets required to use the data, or hardware costs for tools that can capture the data. This can be the case especially when multiple remote locations are connected by a wide-area network (WAN). The alternative to capturing packets is to collect contextual information about network connections in the form of network flow.

The IP Flow Information eXport (IPFIX) protocol, specified in a number of RFCs, including 7011,6 is a standard that defines the export of unidirectional IP flow information from routers, probes, and other devices. Note that IPFIX was based on Cisco NetFlow Version 9. The standard ports that an IPFIX service listens to, as defined by IANA, are udp/4379 and tcp/4379.

A flow, according to the IPFIX standard, consists of network packets that share the same arbitrary number of packet fields within a timeframe (for example, sharing the same source IP address, destination IP address, protocol, source port, and destination port). IPFIX enables you to define your own list of packet fields to match. In IPFIX, network flow information is exported (pushed) using two types of records: flow data and template. The template record is sent infrequently and is used to describe the structure of the data records.

Routers and high-end network switches are the most common devices that can capture, maintain, and update flow records using their cache. These devices can export a record when they believe that a flow has completed or based on fixed time intervals. Keep in mind that capturing, maintaining, and exporting network flow information could impact the system’s overall performance depending on the platform being used. Best practice is working through a capacity-planning exercise and consulting with your network vendor on the impact of enabling the feature. Network device vendors generally maintain testing results per platform and are happy to share test results with their customers.

In addition to routers and switches, there is the option of using dedicated hardware appliances that can convert information collected from captured packets into network flow records that can then be exported. Similar to syslog, you can implement a distributed solution with relays that accept, replicate, and forward network flow information to various destinations such as SIEM tools and network flow analyzers. Some vendors, such as Lancope, offer sensor appliances that can add additional attributes while converting raw packets to NetFlow, such as application layer data that typically would not be included in a flow record.

Depending on your platform, a router (or any other flow-collection device) can support sampled/unsampled flow collection, as shown in Figure 2-3 and Figure 2-4, respectively. In the case of sampled flow collection, to update its flow records, the router looks at every nth packet (for example, 1 in every 128) rather than at every packet that traverses it. This behavior introduces probabilistic security threat detection, meaning some flows might be missed. In addition, relying on sampled flows would result in unreliable digital investigation, assuming network flows are part of your investigation artifacts. For these and other reasons, it is recommended to use only sampled flow collection if no other options are available. An analogy of comparing sampled and unsampled flow is knowing somebody has entered your house within the past few hours versus knowing a user entered your house a few minutes ago and currently is sitting in your living room. Unsampled details are much more valuable, and best practice is using the most current version if possible.

One major benefit of using flow-based detection for security is having “the canary in the coal mine” approach for identifying network breaches, meaning detecting unusual behavior that is not linked to an attack signature. An example is a trusted user performing network reconnaissance followed by connecting to sensitive systems that the user has never accessed before. Most common security products, such as firewalls and intrusion prevention system (IPS) technology, would probably ignore this behavior. A flow-based security product, however, could identify the user as being authorized to perform these actions but still flag the unusual behavior as an indication of compromise.

Another benefit of flow-based security is enabling the entire network as a sensor versus limiting visibility to security products. Typically, this also reduces investment cost in new products by leveraging capabilities within existing equipment. Security products may have limitations as to what they can see, because of traffic being encrypted or where they are placed on the network, thus causing security blind spots. It also might not be feasible to deploy security products at multiple remote locations. These and other scenarios are great use cases for using network flow for security analytics.

Let’s look at a few examples of how to enable NetFlow on devices. Example 2-6 shows the steps to configure NetFlow v9 on a Cisco IOS-based router.

Example 2-6 Configuring NetFlow v9 on a Cisco IOS-Based Router

! Configure the NetFlow collector IP address and port
Router(config)# ip flow-export destination {ip_address | hostname} udp_port
! Configure the router to use NetFlow version 9
Router(config)# ip flow-export version 9
! Specifies the interface that to enable NetFlow on
Router(config)# interfacetype number
! Enables NetFlow on the interface:
! ingress: Captures traffic that is being received by the interface
! egress: Captures traffic that is being transmitted by the interface
Router(config)# ip flow {ingress | egress}

With IPFIX and NetFlow v9, you can do much more than what is shown in Example 2-6. On a Cisco IOS-based router, you can customize your flow records and define what to match and what data to export. Example 2-7 shows an example of this configuration.

Example 2-7 Configuring NetFlow v9 on an IOS-Based Router with a Customized Record

Chapter 6 delves deeper into NetFlow-based technologies. Now let’s look at a different way to monitor the network using packet-capture technology.

Telemetry Data: Packet Capture

There are cases in which you need to go beyond collecting logging messages and network flow information. An example is the need for deep forensic capabilities to meeting strict regulation requirements for capturing raw network packets. Network traffic can be captured and forwarded to an intrusion detection system (IDS), a deep packet inspection engine (DPI), or simply to a repository where captured packets are stored for future use. Your choice of the packet capturing technology is influenced by the network and media type to monitor.

In the case of Ethernet, you can consider two techniques to capture network packets, each with its pros and cons:

Port mirroring: This approach uses network switches to mirror traffic seen on ports or VLANs to other local or remote ports. This is a basic feature supported by most of today’s enterprise-level network switches. The local Switched Port Analyzer (SPAN) configuration for Cisco switches can be used to mirror traffic locally, meaning within the same switch. The remote SPAN (RSAP and ERSPAN) configuration for Cisco switches can extend this feature by allowing remote mirroring of traffic across multiple switches if they are all configured for RSPAN. Note that based on the number of captured packets and state of your network, copying packets to a remote switch can have implications on the overall performance of the network. In addition, it is important to consider how much oversubscription you would allow when copying traffic. For example, you might not want to mirror traffic from multiple 10-Gbps interfaces on a switch to a single 1-Gbps interface. Best practice is carefully selecting the sources and destinations for port mirroring.

Network taps: Another approach is connecting out-of-band devices in the form of network taps to monitor and capture packets from point-to-point links. Network taps capture and copy network packets without involving the active network components, making them suitable for most environments. Network taps, however, cannot capture some traffic, such as packets that are exchanged locally within a switch. It is also financially infeasible to connect taps to all network links. You would generally connect them to the most important locations in your network, such as your Internet gateways and data centers. Network taps are also ideal for on-demand troubleshooting.

NOTE

Whether continuous or on demand, capturing packets is an expensive operation in terms of the amount of data to collect, transfer, analyze, and eventually store. The cost associated with capturing packets can be determined by the amount of data to acquire; the location in your network; and the network, system, and storage resources available for this purpose.

Capturing syslogs, network flows, and packets is not very useful if an administrator is manually shuffling through thousands of events. Even the most trained professionals could miss an important alert or not be able to associate events that look trivial as individual alerts but map out to a larger threat if pieced together. This is where centralized collection solutions show the most value, by parsing and normalizing data so that it can be used later for security analysis that helps administrators identify the most important events to focus on.

Parsing and Normalization

Data that requires further processing and analysis must be first parsed and normalized. Parsing refers to the process of taking raw input in string format and traversing the different fields based on a predefined schema, and normalization refers to the process of allowing similar extracted events from multiple sources to be uniformly stored or consumed by subsequence processing steps.

Let’s look at an example of parsing a message generated by iptables (Linux host-based firewall) for dropped packets on a CentOS Linux host. Example 2-8 shows the original message saved to the local /etc/var/kernel.log file and the version of the same message represented in JavaScript Object Notation (JSON) format. The JSON form was created after being forwarded to and parsed by the log management platform Logstash. Notice that in this example the received syslog message is parsed, but this example does not extend parsing to extract the content of the iptables drop message. This means that in this example, we did not retrieve data such as the action, source and destination IP addresses, TCP ports, TCP headers, interface where the packet was dropped, and so on. This can be achieved by creating a parser that refers to the iptables logging message schema.

Parsing of messages such as event logs can make use of regular expressions also known as regex. Regex are patterns that you can use to extract information from some text input. A pattern can be expressed by a combination of alphanumeric characters and operators in a syntax that is understood by regex processors. An example is matching the string root, which is not case sensitive. This can be expressed using one of the regex patterns shown in Example 2-9. Both statements will match all possible lowercase and uppercase combinations of the string root (for example, rooting, -Root!-, or simply RooT).

Regex is commonly used for creating intrusion detection/prevention signatures, where you can quickly create custom regex-based signatures that match patterns of your choice. This allows you to alert and protect against attacks that try to exploit unpatched systems or alert and protect systems that could not be easily patched. An example is protecting legacy applications or devices used in process control networks.

Example 2-9 Regex Pattern to Match the Non-Case-Sensitive String root

[rR][oO][oO][tT]
OR
[rR][oO]{2}[tT]

Similarly, SIEM tools make use of regex. The schema or exact structure of the message must be known beforehand. SIEM tools must maintain a current schema library for all the different events they can process. In addition, the tools should allow creating custom parsers as required. Failing to property parse and normalize a message could result in being unable to analyze the data.

Security Analysis

Security analysis refers to the process of researching data for the purpose of uncovering potential known and unknown threats. The complexity of the task varies from performing basic incident mapping to advanced mathematical modeling used to discover unknown threats. Revealing relationships between events within a context is achieved using machine learning-based techniques or knowledge-based techniques, such as rule-based matching and statistical anomaly detection.

Event correlation is the most known and used form of data analysis. Security event correlation refers to the task of creating a context within which revealing relationships between disparate events received from various sources for the purposes of identifying and reporting on threats. A context can be bound by time, heuristics, and asset value.

Correlation rules are packaged in SIEM tools. The vendors usually offer the option of performing regular updates to the rule sets as part of a paid support service. These rules can be tuned, or you can create your own rules; however, it is important to first know the use cases you are looking to address. Most correlation rules offered by SIEM vendors are based on experience they gain from their install bases and internal research teams, meaning that most likely they have developed rules for your business requirements. Examples of out-of-box correlation rules include flagging excessive failed logins, malware infection, unauthorized outbound connections, and DoS attempts. It is a good practice to have the SIEM vendor run through your business scenarios during a proof of concept to validate their correlation and reporting capabilities.

It is common practice to tune the out-of-the-box rules or create your own rules that meet your business requirements. Table 2-3 shows some of the use cases shipped with the Splunk SIEM application, referred to as Splunk Enterprise Security Application. Note that the thresholds are listed that you can adjust for each use case.

Table 2-3 Splunk Enterprise Security Correlation Rules

Correlation Search

Description

Default

Endpoint - Active Unremediated Malware Infection

Number of days that the device was unable to clean the infection

3

Endpoint - Anomalous New Services

Number of new services

9

Endpoint - Anomalous New Processes

Number of new processes

9

Endpoint - Anomalous User Account Creation

Number of new processes in a 24-hour period

3

Access - Brute-Force Access Behavior Detected

Number of failures

6

Access - Excessive Failed Logins

Number of authentication attempts

6

Endpoint - High Number of Infected Hosts

Number of infected hosts

100

Endpoint - Host with Excessive Number of Listening Ports

Number of listening ports

20

Endpoint - Host with Excessive Number of Processes

Number of running processes

200

Endpoint - Host with Excessive Number of Services

Number of running services

100

Endpoint - Host with Multiple Infections

Total number of infections per host

> 1

Endpoint - Old Malware Infection

Number of days host had infection

30 days

Endpoint - Recurring Malware Infection

Number of days that the device was re-infected

3 days

Network - Substantial Increase in an Event

Number of events (self-baselines based on average)

3 St Dev.

Network - Substantial Increase in Port Activity (by Destination)

Number of targets (self-baselines based on average)

3 St Dev.

Network - Vulnerability Scanner Detection (by Event)

Number of unique events

25

Network - Vulnerability Scanner Detection (by Targets)

Number of unique targets

25

Correlation rules are meant to detect and report on threat scenarios, also referred to as use cases. Before you formalize a use case, you want to answer the following questions:

What methodology should you use to come up with a use case?

For a use case, what logging messages should you collect and from which devices?

Can you achieve the requirements of a use case using existing security controls (for example, by using an existing intrusion detection/prevention system or a firewall)?

How complex is the task of creating or tuning correlation rules?

How do you associate use cases with your risk-assessment program?

How complicated is the use case, and what impact will it have on the performance of your SIEM tool?

Will the rule created for a use case result in an increase in false positives?

The exact use case and your choice of tools impact the complexity associated with creating or customizing correlation rules. For example, creating a rule that alerts on detecting the use of a clear-text management protocol such as Telnet could be straightforward compared to more complex rules that involve multiple sources, messages, and time periods. Also, it is important to consider the performance impact on the SIEM as your rules grow in size and complexity along with management for customized functions.

Let’s look at the example use case to create a correlation rule that triggers an alert when the same account was used to log in to more than ten data center servers, followed by one or more of these servers establishing one or more outbound TCP connections to external IP addresses within 5 minutes after the ten login attempts. The idea of using this example is to demonstrate how complex creating correlation rules can be for use cases that might sound simple. You can express this use case as a nested statement made of a combination of events (content) and operators such as AND, OR, NOT, and FOLLOWED BY (stateful context). In this use case, a context is nothing but an arbitrary set of parameters that describe a particular event of sequence of events. This nested statement to meet this use case is shown in Example 2-10.

Example 2-10 High-Level Correlation Rule Statement

[
(More than ten successful login events)
AND
(Events are for the same user ID)
AND
(Events generated by servers tagged as data center)
AND
(Events received within a one-minute sliding window)
]
FOLLOWED BY
[
(TCP connection event)
AND
(Source IP address belongs to the data center IP address range)
AND
(Destination IP address does NOT belong to the internal IP
address range)
AND
(Protocol is TCP)
AND
(Events received within fives minutes)
]

After a custom statement has been created, the next step is to convert the statement to a rule following the syntax used by your SIEM tool of choice. Commercial SIEM tools provide a graphical interface for you to complete this task. An alternative is outsourcing rule creation to a third-party consultant or to the SIEM vendor’s professional services. We recommend first verifying with the SIEM vendor that there is not an existing rule or rules that meet your needs before investing time and money into creating customized correlation rules.

Despite the fact that a use case might look simple, converting it to a rule might not be so easy. Even if you were to convert the previous example into a correlation rule, how about the more complicated ones? In addition, how much can you grow your rule base, and what impact on performance would it have on your tool? Let’s look at some alternatives to creating correlation-based rules.

Alternatives to Rule-Based Correlation

Anomaly-based correlation is another approach that can be combined with rule-based correlation. Detecting anomalies relies on first statically profiling your environment to establish a baseline. After you have a baseline, the SIEM can identify activity patterns that deviate from the baseline, alerting on potential security incidents. Profiling an environment typically generates multiple baselines, such as the following:

Traffic rate baseline such as average EPS and peak EPS per day of the week

Network baseline looking at protocol and port usage per day of the week

System baseline monitoring average and peak CPU and memory usage, average number of running services, user login attempts per day of the week

When it comes to profiling peaks, it is important to record not only the highest values reached but also the durations in which noticeable increase of usage were observed, thus adding statefulness to your profiling process. Figure 2-5 is a histogram that shows the distribution of syslog messages sent from Linux hosts in the past 24 hours. The distribution of events shows a spike in the number of events lasting for around 30 minutes. This type of event would generally trigger the interest of a security analyst. Figure 2-6 shows zooming in to this data to identify two different periods of high syslog activities corresponding to what was shown in Figure 2-5. In this specific example, the first period is short and corresponds to the installation of system patches on a number of hosts, and the second (and longer-lasting) period corresponds to a wider remote system compliance check. These might not have been malicious events; however, using anomaly detection can help administrators be more aware of changes in their environment. This proves useful for a response if users complain that the network is running slowly during the spike time periods.

Another approach that could also be combined with rule-based correlation is risk-based correlation, also referred to as algorithmic. The basic idea is to calculate a risk score for an event based on the content and context of an event. Risk scores can be based on asset value, source IP address reputation, geolocation, reported user role (for example, a Lightweight Directory Access Protocol [LDAP] group), and so on. This approach is useful when you do not have much visibility on the use cases you require or when configuring correlation rules is complex. The challenge to this approach is the work required to design the risk formula and assigning values to input types that are being considered.

NOTE

Risk scores do not include probability values. You learn how to calculate risk in Chapter 7, “Vulnerability Management.”

There are other methods to improve network awareness beyond correlating events. Let’s look at additional ways to improve data through data-enrichment sources.

Data Enrichment

Data enrichment refers to the practice of adding additional context to the data that you receive. Common examples of enrichment sources include the following:

WHOIS information, allowing you to tap into further contextual information on IP addresses

Reputation information on domain names, IP addresses and e-mail senders, file hash values, and so on

Domain age information

This overlay knowledge you gain helps you make more informative decisions, increasing the accuracy of your threat-detection processes and tools.

Typically, enrichment is applied to post-parsed messages just before data is stored or processed in real time or off line. This can sometimes help security products save process power by blocking known attacks, such as sources with negative reputation, at a preprocess stage. Figure 2-7 shows a sample enrichment process. The figure also shows that enrichment information can be acquired in real time or from an existing cache.

Big Data Platforms for Security

Using relational databases to store and query data does not scale well and is becoming a huge problem for organizations as information requirements continue to increase. The solution is to use big data platforms that can accept, store, and process large amounts of data. In the context of security, big data platforms should not only be scalable in terms of storing and retrieving large amounts data but also support services offered by traditional log management and SIEM tools. This hybrid of capabilities and storage is critical for storing, processing, and analyzing big data in real time or on demand.

Most of today’s big data platforms are based on Apache Hadoop. This framework allows for the distributed processing of large data sets across clusters of computers using HDFS, MapReduce, and YARN to form the core of Apache Hadoop. At the heart of the platform is the Hadoop Distributed File System (HDFS) distributed storage system. YARN is a framework for job scheduling and cluster resource management. MapReduce is a YARN-based system for parallel processing of large data sets. In addition, many Hadoop-related projects deliver services on top of Hadoop’s core services.

Open source-based log management and processing tools are starting to present themselves as viable replacements to legacy the SIEM tools. This is not only the case of storage and offline processing of data but also for real-time processing using (for example, Apache Storm7). Figure 2-8 shows the architecture of the Cisco OpenSOC platform, which is based largely on a number of Apache projects, allowing data of various format to be collected, stored, processed (on-line and off-line), and reported.