Anonymize data

You might need to anonymize, or mask, sensitive personal information from the data that you index into Splunk Enterprise, such as credit card or Social Security numbers. You can anonymize parts of confidential fields in events to protect privacy while providing enough remaining data for use in event tracking. You can configure Splunk Enterprise indexers or heavy forwarders to anonymize data as it arrives and before the software indexes it.

There are two ways to anonymize data with Splunk Enterprise:

With a sed script. This method is easier to do, takes less time to configure, and is slightly faster, but has limits in how many times you can invoke it and what it can do. For instructions on this method, see Anonymize data with a sed script

With a regular expression (regex) transform. This method takes longer to configure, but is easier to modify after the initial configuration and can be assigned to multiple data inputs more easily. For instructions on this method, see Anonymize data with a regular expression transform

To anonymize data with Splunk Cloud, you must configure a Splunk Enterprise instance as a heavy forwarder and anonymize the incoming data with that instance before sending it to Splunk Cloud. You can follow the instructions in this topic on the heavy forwarder.

Key points to anonymizing data

Before you can anonymize data, you must select a set of events to anonymize.

You use props.conf to select the events to anonymize

You then use props.conf to anonymize the events with a sed script

Or, you use props.conf and transforms.conf to anonymize the events with a regular expression transform

Select events to anonymize

You can anonymize event data based on whether the data comes from a specific source or host, or is tagged with a specific source type. You must specify which method to use in props.conf. The stanza name that you specify in props.conf determines how Splunk Enterprise selects and processes events for anonymization.

[host::<host>] matches events that contain the specified host

source::<source> matches events with the specified source

<sourcetype> matches events with the specified source type. You must specify the source type in inputs.conf for this stanza type to work. This option is a Splunk best practice.

Replace strings in events with SEDCMD

You can use the SEDCMD method to replace strings or substitute characters.
The syntax for a sed replace is:

SEDCMD-<class> = s/<regex>/<replacement>/flags

The SEDCMD command has the following components:

regex is a Perl language regular expression

replacement is a string to replace the regular expression match.

flags can be either the letter g to replace all matches or a number to replace a specified match.

Substitute characters in events with SEDCMD

The syntax for a sed character substitution is:

SEDCMD-<class> = y/<string1>/<string2>/

This substitutes each occurrence of the characters in string1 with the characters in string2.

Use a regular expression transform with transforms.conf to anonymize events

Each stanza in transforms.conf defines a transform class that you can reference from props.conf for a given source type, source, or host.

Transforms have several settings and variables that let you specify what changes and where, but the following are the most important:

The REGEX setting specifies the regular expression that points to the string in the event that you want to anonymize

The FORMAT setting specifies the masked values

The $1 variable represents the text of the event before the regular expression that represents the string in the event that you want to mask

The $2 variable represents the text of the event after the regular expression

DEST_KEY = _raw says to write the value from FORMAT to the raw value in the log. This anonymizes the event.

The regular expression processor does not handle multiline events. In cases where events span multiple lines, specify that the event is multiline by placing (?m) before the regular expression in transforms.conf.

Anonymize data with a sed script

You can anonymize data by using a sed script to replace or substitute strings in events.

Sed is a *nix utility that reads a file and modifies the input based on commands that you use within or arguments that you supply to the utility. Many *nix users use the utility for fast transformation of incoming because the utility is so versatile. Splunk Enterprise lets you use a sed-like syntax in props.conf to script the masking of your data.

You can use inputs.conf and props.conf to change the data that comes in from accounts.log as Splunk Enterprise accesses it. These configuration files reside in the $SPLUNK_HOME/etc/system/local/ directory.

Configure inputs.conf to use a sed script

In this example, you create the source type SSN-CC-Anon, and assign it to the data input for accounts.log. The transform that you create uses this source type to know what data to transform. While there are other options available for using SEDCMD to transform incoming data from a log file, best practice is to create a source type, then assign the transform to that source type in props.conf.

On the machine that runs Splunk Enterprise, create an inputs.conf file in the $SPLUNK_HOME/etc/system/local directory. If the file already exists, proceed to the next step.

Open $SPLUNK_HOME/etc/system/local/inputs.conf with a text editor.

Add the following stanza to reference MyAppServer.log and assign a source type to the MyAppServer.log data.

[monitor:///opt/appserver/logs/accounts.log]
sourcetype = SSN-CC-Anon

Save the file and close it.

Define the sed script in props.conf

In this example, props.conf uses the SEDCMD setting to perform the transformation directly.

The "-Anon" clause after the "SEDCMD" stem can be any string that helps you identify what the transformation script does. The clause must exist because it and the SEDCMD stem form the class name for the script. The text after the = is the regular expression that invokes the transformation.

On the machine that runs Splunk Enterprise, create a props.conf in the $SPLUNK_HOME/etc/system/local directory. If the file already exists, proceed to the next step.

Open $SPLUNK_HOME/etc/system/local/props.conf with a text editor.

Add the following stanza to reference the transforms that you created in transforms.conf to do the masking transformation.

Anonymize data with a regular expression transform

You can mask data by creating a transform. Transforms take incoming data and change it based on configurations you supply. In this case, the transformation is the replacement of portions of the data with characters that obscure the real, sensitive data, while retaining the original data format.

Prerequisites for anonymizing data with a regular expression transform

To mask sensitive data, you need the following items:

Data that you want to anonymize

An understanding of how regular expressions work

An inputs.conf file, with a configuration that tells Splunk Enterprise where this data is located

A transforms.conf file that does the data masking

A props.conf file that references the transforms.conf file for the data that you want to mask

For example, if you have an application server log file called MyAppServer.log that contains events like the following:

Use the inputs.conf, props.conf, and transforms.conf files to change the data that comes in from MyAppServer.log as Splunk Enterprise accesses it. All of these configuration files reside in the $SPLUNK_HOME/etc/system/local/ directory.

Configure inputs.conf

In this example, you create the MyAppServer-Anon source type. The transform you create uses this source type to know what data to transform. There are other options for selecting the data to transform, that will be explained later in this topic.

On the machine that runs Splunk Enterprise, create an inputs.conf file in the $SPLUNK_HOME/etc/system/local directory. If the file already exists, proceed to the next step.

Open $SPLUNK_HOME/etc/system/local/inputs.conf with a text editor.

Add the following stanza to reference MyAppServer.log and assign a source type to the MyAppServer.log data.

Example

You have a file you want to index, abc.log, and you want to substitute the capital letters "A", "B", and "C" for every lowercase "a", "b", or "c" in your events. Add the following stanza and settings to your props.conf:

[source::.../abc.log]
SEDCMD-abc = y/abc/ABC/

Splunk Enterprise substituted "A" for each "a", "B" for each "b", and "C" for each "c'. When you search for source="*/abc.log", the lowercase letters "a", "b", and "c" do not appear in your data.

Caveats for anonymizing data

Restrictions for using the sed script to anonymize data

If you use the SEDCMD method to anonymize the data, the following restrictions apply:

The SEDCMD script applies only to the _raw field at index time. With the regular expression transform, you can apply changes to other fields.

You cannot use more than one SEDCMD type transformation for the same host, source, or source type in a single props.conf file.

Restrictions for using the regular expression transform to anonymize data

If you use the regular expression transform to anonymize data, the following restrictions apply, include the LOOKAHEAD setting when you define the transform and set it to a number that is larger than the largest expected event. Otherwise, anonymization could fail.

Splunk indexers do not parse structured data

When you forward structured data to an indexer, the indexer does not parse it, even if you configured props.conf on that indexer with the INDEXED_EXTRACTIONS setting. Forwarded data skips the following queues on the indexer, which precludes data parsing:

parsing

aggregation

typing

The forwarded data must arrive at the indexer already parsed. To achieve this, you must set up props.conf on the forwarder that sends the data. This includes configuring the INDEXED_EXTRACTIONS setting and any other parsing, filtering, anonymizing, and routing rules.

Comments

Under the heading "Use a regular expression transform with transforms.conf to anonymize events", $1 = text of the event before the regular expression and $2 = text of the event after the regular expression.

However, in transforms.conf doc page, for the FORMAT attr, under index-time extractions, special identifier $0 represents what was in the DEST_KEY before the
REGEX was performed, and $1, $2,... are each regex matching group, in order.

A clarification and possible correction is needed on the $0 and "text of the event before the regular expression"

Enter your email address, and someone from the documentation team will respond to you:

Send me a copy of this feedback

Please provide your comments here. Ask a question or make a suggestion.

Feedback submitted, thanks!

You must be logged into splunk.com in order to post comments.
Log in now.

Please try to keep this discussion focused on the content covered in this documentation topic.
If you have a more general question about Splunk functionality or are experiencing a difficulty with Splunk,
consider posting a question to Splunkbase Answers.

0
out of 1000 Characters

Your Comment Has Been Posted Above

We use our own and third-party cookies to provide you with a great online experience. We also use these cookies to improve our products and services, support our marketing campaigns, and advertise to you on our website and other websites. Some cookies may continue to collect information after you have left our website.
Learn more (including how to update your settings) here »