Archive indexed data

You can configure the indexer to archive your data automatically as it ages; specifically, at the point when it rolls to "frozen". To do this, you configure indexes.conf.

Caution: By default, the indexer deletes all frozen data. It removes the data from the index at the moment it becomes frozen. If you need to keep the data around, you must configure the indexer to archive the data before removing it. You do this by either setting the coldToFrozenDir attribute or specifying a valid coldToFrozenScript in indexes.conf.

How the indexer archives data

The indexer rotates old data out of the index based on your data retirement policy, as described in Set a retirement and archiving policy. Data moves through several stages, which correspond to file directory locations. Data starts out in the hot database, located as subdirectories ("buckets") under $SPLUNK_HOME/var/lib/splunk/defaultdb/db/. It then moves to the warm database, also located as subdirectories under $SPLUNK_HOME/var/lib/splunk/defaultdb/db. Eventually, data is aged into the cold database $SPLUNK_HOME/var/lib/splunk/defaultdb/colddb.

Finally, data reaches the frozen state. This can happen for a number of reasons, as described in Set a retirement and archiving policy. At this point, the indexer erases the data from the index. If you want the indexer to archive the frozen data before erasing it from the index, you must specify that behavior. You can choose two ways of handling the archiving:

The archiving behavior depends on which of these indexes.conf attributes you set:

coldToFrozenDir. This attribute specifes a location where the indexer will automatically archive frozen data.

coldToFrozenScript. This attribute specifes a user-supplied script that the indexer will run when the data is frozen. Typically, this will be a script that archives the frozen data. The script can also serve some other purpose altogether. While the indexer ships with one example archiving script that you can edit and use ($SPLUNK_HOME/bin/coldToFrozenExample.py), you can actually specify any script you want the indexer to run.

Note: You can only set one or the other of these attributes. The coldToFrozenDir attribute takes precedence over coldToFrozenScript, if both are set.

If you don't specify either of these attributes, the indexer runs a default script that simply writes the name of the bucket being erased to the log file $SPLUNK_HOME/var/log/splunk/splunkd_stdout.log. It then erases the bucket.

Let the indexer archive the data for you

If you set the coldToFrozenDir attribute in indexes.conf, the indexer will automatically copy frozen buckets to the specified location before erasing the data from the index.

Add this stanza to $SPLUNK_HOME/etc/system/local/indexes.conf:

[<index>]
coldToFrozenDir = <path to frozen archive>

Note the following:

<index> specifies which index contains the data to archive.

<path to frozen archive> specifies the directory where the indexer will put the archived buckets.

Note: When you use Splunk Web to create a new index, you can also specify a frozen archive path for that index. See Create custom indexes for details.

How the indexer archives the frozen data depends on whether the data was originally indexed in a pre-4.2 release:

For buckets created from version 4.2 and on, the indexer will remove all files except for the rawdata file.

For pre-4.2 buckets, the script simply gzip's all the .tsidx and .data files in the bucket.

This difference is due to a change in the format of rawdata. Starting with 4.2, the rawdata file contains all the information the indexer needs to reconstitute an index bucket.

The indexer ships with an example archiving script that you can edit, $SPLUNK_HOME/bin/coldToFrozenExample.py.

Note: If using the example script, edit it to specify the archive location for your installation. Also, rename the script or move it to another location to avoid having changes overwritten when you upgrade the indexer. This is an example script and should not be applied to a production instance without editing to suit your environment and testing extensively.

The example script archives the frozen data differently, depending on whether the data was originally indexed in a pre-4.2 release:

For buckets created from version 4.2 and on, it will remove all files except for the rawdata file.

For pre-4.2 buckets, the script simply gzip's all the .tsidx and .data files.

This difference is due to a change in the format of rawdata. Starting with 4.2, the rawdata file contains all the information the indexer needs to reconstitute an index bucket.

As a best practice, make sure the script you create completes as quickly as possible, so that the indexer doesn't end up waiting for the return indicator. For example, if you want to archive to a slow volume, set the script to copy the buckets to a temporary location on the same (fast) volume as the index. Then use a separate script, outside the indexer, to move the buckets from the temporary location to their destination on the slow volume.

Data archiving and indexer clusters

In an Indexer cluster, each individual peer node rolls its buckets to frozen, in the same way that a non-clustered indexer does; that is, based on its own set of configurations. Because all peers in a cluster should be configured identically, all copies of a bucket should roll to frozen at approximately the same time.

However, there can be some variance in the timing, because the same index can grow at different rates on different peers. The cluster performs processing to ensure that buckets freeze smoothly across all peers in the cluster. Specifically, it performs processing so that, if a bucket is frozen on one peer but not on another, the cluster does not initiate fix-up activities for that bucket. See How the cluster handles frozen buckets.

The problem of archiving multiple copies

Because indexer clusters contain multiple copies of each bucket. If you archive the data using the techniques described earlier in this topic, you archive multiple copies of the data.

For example, if you have a cluster with a replication factor of 3, the cluster stores three copies of all its data across its set of peer nodes. If you set up each peer node to archive its own data when it rolls to frozen, you end up with three archived copies of the data. You cannot solve this problem by archiving just the data on a single node, since there's no certainty that a single node contains all the data in the cluster.

The solution to this would be to archive just one copy of each bucket on the cluster and discard the rest. However, in practice, it is quite a complex matter to do that. If you want guidance in archiving single copies of clustered data, contact Splunk Professional Services. They can help design a solution customized to the needs of your environment.

Specifying the archive destination

If you choose to take the easy approach and archive multiple copies of the clustered data, you must guard against name collisions. You cannot route the data from all peer nodes into a single archive directory, because multiple, identically named copies of the bucket will exist across the cluster (for deployments where replication factor >= 2), and the contents of a directory must be named uniquely. Instead, you need to ensure that the buckets from each of your peer nodes go to a separate archive directory. This, of course, will be somewhat difficult to manage if you specify a destination directory in shared storage by means of the coldToFrozenDir attribute in indexes.conf, because the indexes.conf file must be the same across all peer nodes, as discussed in Configure the peer indexes in an indexer cluster. One alternative approach would be to create a script that directs each peer's archived buckets to a separate location on the shared storage, and then use the coldToFrozenScript attribute to specify that script.

Comments

Is this correct?

"If you don't specify either of these attributes, the indexer runs a default script that simply writes the name of the bucket being erased to the log file $SPLUNK_HOME/var/log/splunk/splunkd_stdout.log. It then erases the bucket."

I've found that it just writes to the normal splunkd.log not splunkd_stdout.log

Enter your email address, and someone from the documentation team will respond to you:

Send me a copy of this feedback

Please provide your comments here. Ask a question or make a suggestion.

Feedback submitted, thanks!

You must be logged into splunk.com in order to post comments.
Log in now.

Please try to keep this discussion focused on the content covered in this documentation topic.
If you have a more general question about Splunk functionality or are experiencing a difficulty with Splunk,
consider posting a question to Splunkbase Answers.

0
out of 1000 Characters

Your Comment Has Been Posted Above

We use our own and third-party cookies to provide you with a great online experience. We also use these cookies to improve our products and services, support our marketing campaigns, and advertise to you on our website and other websites. Some cookies may continue to collect information after you have left our website.
Learn more (including how to update your settings) here »