Tuning Bloom filters

DataStax Enterprise uses Bloom filters to determine whether an SSTable has data for a
particular row.

DataStax Enterprise uses Bloom filters to determine whether an SSTable has data for a
particular partition. Bloom filters are unused for range scans, but are used for index scans.
Bloom filters are probabilistic sets that allow you to trade memory for accuracy. This means that
higher Bloom filter attribute settings bloom_filter_fp_chance use less memory, but will result in
more disk I/O if the SSTables are highly fragmented. Bloom filter settings range from 0 to 1.0
(disabled). The default value of bloom_filter_fp_chance depends on the compaction strategy.

The LeveledCompactionStrategy
(LCS) uses a higher default value (0.1) than the SizeTieredCompactionStrategy (STCS),
which has a default of 0.01. Memory savings are nonlinear; going from 0.01 to 0.1 saves about one
third of the memory. SSTables using LCS contain a relatively smaller ranges of keys than those
using STCS, which facilitates efficient exclusion of the SSTables even without a bloom filter;
however, adding a small bloom filter helps when there are many levels in LCS.

The settings you choose depend the type of workload. For example, to run an analytics
application that heavily scans a particular table, you would want to inhibit the Bloom filter on
the table by setting it high.

To view the observed Bloom filters false positive rate and the number of SSTables consulted per
read use tablestats in the nodetool utility.

Bloom filters are stored off-heap so you don't need include it when determining the -Xmx
settings (the maximum memory size that the heap can reach for the JVM).

Tip: If the SSTables are already on the current version, the nodetool upgradesstables command returns immediately and no action is taken. You must
use the -a command argument to force the SSTable upgrade.

You do not have to restart DataStax Enterprise after regenerating SSTables.