How Splunk stores indexes

As Splunk indexes your data, it creates a bunch of files. These files contain two types of data:

The raw data in compressed form ("rawdata")

Indexes that point to the raw data ("index files")

Together, these files constitute the Splunk index. The files reside in sets of directories organized by age. Some directories contain newly indexed data; others contain previously indexed data. The number of such directories can grow quite large, depending on how much data you're indexing.

Why you might care

You might not care, actually. Splunk handles indexed data by default in a way that gracefully ages the data through several stages. After a long period of time, typically several years, Splunk removes old data from your system. You might well be fine with the default scheme it uses.

However, if you're indexing large amounts of data, have specific data retention requirements, or otherwise need to carefully plan your aging policy, you've got to read this topic. Also, to back up your data, it helps to know where to find it. So, read on....

How Splunk ages data

Each of the index directories is known as a bucket. To summarize so far:

A Splunk "index" contains compressed raw data and associated indexes.

A Splunk index resides across many age-designated index directories.

An index directory is a bucket.

A bucket moves through several stages as it ages:

hot

warm

cold

frozen

As buckets age, they "roll" from one stage to the next. Newly indexed data goes into a hot bucket, which is a bucket that's both searchable and actively being written to. After the hot bucket reaches a certain size, it becomes a warm bucket, and a new hot bucket is created. Warm buckets are searchable, but are not actively written to. There are many warm buckets.

Once Splunk has created some maximum number of warm buckets, it begins to roll the warm buckets to cold based on their age. Always, the oldest warm bucket rolls to cold. Buckets continue to roll to cold as they age in this manner. After a set period of time, cold buckets roll to frozen, at which point they are either archived or deleted. By editing attributes in indexes.conf, you can specify the bucket aging policy, which determines when a bucket moves from one stage to the next.

Here are the stages that buckets age through:

Bucket stage

Description

Searchable?

Hot

Contains newly indexed data. Open for writing. One or more hot buckets for each index.

Yes.

Warm

Data rolled from hot. There are many warm buckets.

Yes.

Cold

Data rolled from warm. There are many cold buckets.

Yes.

Frozen

Data rolled from cold. Splunk deletes frozen data by default, but you can also archive it.

No.

The collection of buckets in a particular stage is sometimes referred to as a database or "db": the "hot db", the "warm db", the "cold db", etc.

Note: Hot buckets always roll whenever splunkd gets restarted.

What the index directories look like

Each bucket occupies its own subdirectory within a larger database directory. Splunk organizes the directories to distinguish between hot/warm/cold buckets. In addition, the bucket directory names are based on the age of the data.

Here's the directory structure for the default index (defaultdb):

Bucket type

Default location

Notes

Hot

$SPLUNK_HOME/var/lib/splunk/defaultdb/db/*

There can be multiple hot subdirectories. Each hot bucket occupies its own subdirectory, which uses this naming convention:

hot_v1_<ID>

Warm

$SPLUNK_HOME/var/lib/splunk/defaultdb/db/*

There are multiple warm subdirectories. Each warm bucket occupies its own subdirectory, which uses this naming convention:

db_<newest_time>_<oldest_time>_<ID>

where <newest_time> and <oldest_time> are timestamps indicating the age of the data within.

The timestamps are expressed in UTC epoch time (in seconds). For example: db_1223658000_1223654401_2835 is a warm bucket containing data from October 10, 2008, covering the exact period of 9am-10am.

Cold

$SPLUNK_HOME/var/lib/splunk/defaultdb/colddb/*

There are multiple cold subdirectories. When warm buckets roll to cold, they get moved into this directory, but are not renamed.

Configure your indexes

You configure indexes in indexes.conf. You can edit a copy of indexes.conf in $SPLUNK_HOME/etc/system/local/ or in a custom app directory in $SPLUNK_HOME/etc/apps/. Do not edit the copy in $SPLUNK_HOME/etc/system/default. For information on configuration files and directory locations, see "About configuration files".

Table of indexes.conf attributes

This table lists the key indexes.conf attributes affecting buckets and what they configure. It also provides links to other topics (or sections in this topic) that show how to use these attributes. For the most detailed information on these attributes, as well as others, always refer to "indexes.conf".

Note: You can also configure the path to your indexes in Splunk Manager. Go to Splunk Manager > System settings > General settings. Under the section Index settings, set the field Path to indexes.

Use multiple partitions for index data

Splunk can use multiple disks and partitions for its index data. It's possible to configure Splunk to use many disks/partitions/filesystems on the basis of multiple indexes and bucket types, so long as you mount them correctly and point to them properly from indexes.conf. However, we recommend that you use a single high performance file system to hold your Splunk index data for the best experience.

If you do use multiple partitions, the most common way to arrange Splunk's index data is to keep the hot/warm buckets on the local machine, and to put the cold bucket on a separate array or disks (for longer term storage). You'll want to run your hot/warm buckets on a machine with with fast read/write partitions, since most searching will happen there. Cold buckets should be located on a reliable array of disks.

Configure multiple partitions

1. Set up partitions just as you'd normally set them up in any operating system.

2. Mount the disks/partitions.

3. Edit indexes.conf to point to the correct paths for the partitions. You set paths on a per-index basis, so you can also set separate partitions for different indexes. Each index has its own [<index>] stanza, where <index> is the name of the index. These are the settable path attributes:

homePath = <path on server>

This is the path that contains the hot and warm databases for the index.

Caution: The path must be writable.

coldPath = <path on server>

This is the path that contains the cold databases for the index.

Caution: The path must be writable.

thawedPath = <path on server>

This is the path that contains any thawed databases for the index.

How to configure index storage size

You can configure index storage size in a number of ways:

On a per-index basis

For hot/warm and cold buckets separately

Across indexes, with volumes

Caution: While processing indexes, Splunk might occasionally exceed the configured maximums for short periods of time. When setting limits, be sure to factor in some buffer space. Also, note that certain systems, such as most Unix systems, maintain a configurable reserve space on their partitions. You must take that reserve space, if any, into account when determining how large your indexes can grow.

Configure index size for each index

To set the maximum index size on a per-index basis, use the maxTotalDataSizeMB attribute. When this limit is reached, buckets begin rolling to frozen.

Configure index size according to bucket type

To set the maximum size for homePath (hot/warm bucket storage) or coldPath (cold bucket storage), use the maxDataSizeMB attribute:

The maxDataSizeMB attributes can be set globally or for each index. An index-level setting will override a global setting. maxVolumeDataSizeMB can be used with volumes, described below, to control bucket storage across groups of indexes.

Configure index size with volumes

You can manage disk usage across multiple indexes by creating volumes and specifying maximum data size for them. A volume represents a directory on the file system where indexed data resides.

Volumes can store data from multiple indexes. You would typically use separate volumes for hot/warm and cold buckets. For instance, you can set up one volume to contain the hot/warm buckets for all your indexes, and another volume to contain the cold buckets.

Note: Volumes are only for homePath and coldPath. This feature does not work for thawedPath.

Setting up a volume

To set up a volume, use this syntax:

[volume:<volume_name>]
path = <pathname_for_volume>

You can also optionally include a maxVolumeDataSizeMB attribute, which specfies the maximum size for the volume.

For example:

[volume:hot1]
path = /mnt/fast_disk
maxVolumeDataSizeMB = 100000

The example defines a volume called "hot1", located at /mnt/fast_disk, with a maximum size of 100,000MB.

Similarly, this stanza defines a volume called "cold1" that uses a maximum of 150,000MB:

[volume:cold1]
path = /mnt/big_disk
maxVolumeDataSizeMB = 150000

Using a volume

You can now define an index's homePath and coldPath in terms of volumes. For example, using the volumes defined above, you can define two indexes:

You can use volumes to manage index storage space in any way that makes sense to you. Usually, however, volumes correlate to hot/warm and cold buckets, because of the different storage requirements typical when dealing with different bucket types. So, you will probably use some volumes exclusively for designating homePath (hot/warm buckets) and others for coldPath (cold buckets).

When a volume containing warm buckets reaches its maxVolumeDataSizeMB, it starts rolling buckets to cold. When a volume containing cold buckets reaches its maxVolumeDataSizeMB, it starts rolling buckets to frozen. If a volume contains both warm and cold buckets (which will happen if an index's homePath and coldPath are both set to the same volume), the oldest bucket will be rolled to frozen.

Putting it all together

This example shows how to use the per-index homePath.maxDataSizeMB and coldPath.maxDataSizeMB attributes in combination with volumes to maintain fine-grained control over index storage. In particular, it shows how to prevent bursts of data into one index from triggering massive bucket moves from other indexes. The per-index settings can be used to assure that no index will ever occupy more than a specified size, thereby alleviating the concern:

Troubleshoot your buckets

This section tells you how to deal with an assortment of bucket problems.

Recover after a crash

Splunk usually handles crash recovery without your intervention. If an indexer goes down unexpectedly, some recently received data might not be searchable. When you restart Splunk, it will automatically run the fsck command in the background. This command diagnoses the health of your buckets and rebuilds search data as necessary.

It is highly unlikely that you will need to run fsck manually. This is a good thing, because to run it manually, you must stop Splunk, and the command can take several hours to complete if your indexes are large. During that time your data will be inaccessible. However, if Splunk Support directs you to run it, the rest of this section tells you how to do so. (Also, you will need to run fsck manually to perform recovery for any 4.2.x indexers. Only Splunk indexers at version 4.3 or above run it automatically.)

To run fsck manually, you'll need to first stop Splunk. Then run fsck against any affected buckets. To run fsck against buckets in all indexes, use this command:

splunk fsck --repair --all

This will rebuild all types of buckets (hot/warm/cold/thawed) in all indexes.

Note: The fsck command only rebuilds buckets created by version 4.2 or later of Splunk.

To learn more about the fsck command, including a list of all options available, enter:

splunk fsck

Warning: The fsck --repair command can take as long as several hours to run, depending on the size of your indexes. That's why you want to let Splunk run it in the background automatically, if possible. Also, if you can determine that you only need to rebuild a few buckets, you can run the rebuild command on just those buckets, as described in the next section, "Rebuild a bucket."

If you just want to diagnose the state of your indexes (without taking any immediate remedial action), run fsck without the --repair flag:

splunk fsck --all

Rebuild a bucket

If the index and metadata files in a bucket (version 4.2 and later) somehow get corrupted, you can rebuild the bucket from the raw data file alone. Use this command:

splunk rebuild <bucket directory>

Splunk automatically deletes the old index and metadata files and rebuilds them. You don't need to delete any files yourself.

Important: You must stop Splunk before running the rebuild command.

A few notes:

Rebuilding a bucket does not count against your license.

The time required to rebuild a bucket is slightly less than the time required to index the same data initially.

Recover invalid pre-4.2 hot buckets

A hot bucket becomes an invalid hot (invalid_hot_<ID>) bucket when Splunk detects that the metadata files (Sources.data, Hosts.data, SourceTypes.data) are corrupt or incorrect. Incorrect data usually signifies incorrect time ranges; it can also mean that event counts are incorrect.

Splunk ignores invalid hot buckets. Data does not get added to such buckets, and they cannot be searched. Invalid buckets also do not count when determining bucket limit values such as maxTotalDataSizeMB. This means that invalid buckets do not negatively affect the flow of data through the system, but it also means that they can result in disk storage that exceeds the configured maximum value.

Rebuild index-level bucket manifests

It is rare that you might have reason to rebuild index-level manifests, but if you need to, Splunk provides a few commands that do just that.

Caution: You should only use these commands if Splunk support directs you to. Do not rebuild the manifests on your own.

The two index-level manifest files are .bucketManifest and .metaManifest. The .bucketManifest file contains a list of all buckets in the index. You might need to rebuild this if, for example, you manually copy a bucket into an index. The .metaManifest file contains a list of buckets that have contributed to the index-level metadata file.

The following command rebuilds the .bucketManifest and .metaManifest files and all *.data files in the homePath for the main index only. It does not rebuild metadata for individual buckets.

Comments

Exl3074 - Thanks for catching that typo. It's fixed now!

Sgoodman, Splunker

May 17, 2012

For the attribute "coldToFrozenScript ", in the table, it said - "Script to run just before a cold bucket rolls to frozen. If you set both this attribute and codeToFrozenDir, Splunk will use codeToFrozenDir and ignore this attribute. " I believe it is "coldToFrozenDir", not "codeToFrozenDir", right?

Enter your email address, and someone from the documentation team will respond to you:

Send me a copy of this feedback

Please provide your comments here. Ask a question or make a suggestion.

Feedback submitted, thanks!

You must be logged into splunk.com in order to post comments.
Log in now.

Please try to keep this discussion focused on the content covered in this documentation topic.
If you have a more general question about Splunk functionality or are experiencing a difficulty with Splunk,
consider posting a question to Splunkbase Answers.

0
out of 1000 Characters

Your Comment Has Been Posted Above

We use our own and third-party cookies to provide you with a great online experience. We also use these cookies to improve our products and services, support our marketing campaigns, and advertise to you on our website and other websites. Some cookies may continue to collect information after you have left our website.
Learn more (including how to update your settings) here »