Tests to ensure that your cluster's memory, CPU, disks, number of nodes, and network settings meet your business needs.

Testing your cluster before production

Tests to ensure that your cluster's memory, CPU, disks, number of nodes, and network
settings meet your business needs.

To ensure that your selections of memory, CPU, disks, number of nodes, and network settings
meet your business needs, test your cluster prior to production. By simulating your production
workload, you can avoid unpleasant surprises and ensure successful production deployment.

Your testing workload should match the production workload as closely as possible. Variances
from production in testing must fully account for any differences between workloads.

Considerations if you're coming from a relational database world

Cassandra-based applications and clusters are much different than relational
databases and use a data model based on the types of queries, not on modeling entities and
relationships. Besides learning a new database technology and API, writing an application
that scales on a distributed database instead of a single machine requires a new way of
thinking.

A common strategy new users employ for making the transition from single machine to
distributed node is to make heavy use of materialized views, secondary indexes, and UDF
(user defined functions). It’s important to realize that in most applications at large scale
these techniques are not heavily used, and if they are, allowances must be made for the
performance penalty in node sizing and SLAs.

Service level agreements targets

Before going into production, decide what constitutes fast and slow Service level agreements
(SLA), and how often slow SLAs are allowed to occur.

Table 1. Basic rules for determining request percentiles

Percentile

Use case

100

Not recommended: The worst measurement is often irrelevant to testing and based on
factors external to the database.

99.9

Aggressive: Use only for latency sensitive use cases. Be aware that hardware expenses
will be high if you need to keep this close to average.

99

Sweet spot and the most common.

95

Loose: Use when where latency is not easily noticed this is a good metric to
measure.

The following metrics are based on clusters with more data on disk than RAM. Measurement is in
the 99th percentile using nodetool tablehistograms for DSE 6.7 or DDAC.

Table 2. Latency metrics

Hardware

Latency

Top-of-the-line high CPU nodes with NVM flash storage

< 10ms

Common SATA based SSDs

< 150ms

Factors, such as testing, controller, and contention, can shift this
measurement significantly. This can result in some hardware approaching NVM
specifications.

Common SATA based-systems

500ms is common.

This metric varies significantly due to controller, contention,
data model, and so on. Tuning can shift this figure an order of magnitude.

Establish TPS targets

Determine how many transactions per second (TPS) you expect your cluster to handle. Ideally,
match this number with the same node count in your test cluster. Alternately, test with
ratio of TPS to nodes to evaluate how the cluster behaves.

If the cluster does not meet your SLA requirements, look for the root cause. Helpful questions
include:

Monitor the important metrics

Note: This functionality is not available for the DataStax Distribution of Apache Cassandra™.

For DSE and DDAC, you can also use JMX MBeans or Linux iostat or
dstat commands.

Monitor the following metrics:

System hints

Generated system hints should not be continually high. If a node is down or a datacenter
link is down, expect this metric to accumulate (by design).

Pending compactions

The number of acceptable pending compactions depends on the type of compaction
strategy.

Compaction strategy

Pending compactions

SizedTieredCompactionStrategy (STCS)

Should not exceed 100 for more than short periods.

LeveledCompactionStrategy (LCS)

Not an accurate measure for pending compactions. Generally, if it stays
over 100 is indication of stress. Over 1000 is pathological.

TimeWindowCompactionsStrategy (TWCS)

Pending writes

Should not go over 10,000 writes for more than short periods. Otherwise, a tuning or
sizing problem exists.

heap usage

If huge spikes with long pauses exist, adjust the heap.

I/O wait

Metrics greater than 5 indicates that the CPU is waiting too long on I/O and is a sign
of being I/O bound.

CPU Usage

For CPUs using hyper-threading, it is helpful to remember that for each processor core
that is physically present, the operating system addresses two virtual (logical) cores
and shares the workload between them. This means that if half your threads are at 100%
CPU usage, the CPU capacity is at maximum even if the tool reports 50% usage. To see the
actual thread usage, use the ttop command.

Simulate real outages

Simulating outages is key to ensuring that your cluster functions well in production. To
simulate outages:

Take one or two nodes offline (more is preferable, depending on the cluster size).

Do your queries start failing?

Do the nodes start working without any change when they come back online?

Take a datacenter offline while taking load.

How long can it be offline before
system.hints build up to unmanageable levels?

Run an expensive task that eats up CPU on a couple of nodes, such as find "a /

This
scenario helps you prepare for the effect on your nodes in case this happens.

Do expensive range queries in CQLSH while taking load to see how the cluster behaves. For
example:

PAGING OFF;
SELECT * FROM FOO LIMIT 100000000;

This test
ensures your cluster is ready for naive or unintended actions by users.

Simulate production operations

Do the following during load testing and repeat under outages as described
above:

Run repair to ensure that you have enough overhead to complete the repair.

Bootstrap a new node into the cluster.

Add a new datacenter. This exercise gives you an idea of the process when you need to do it
in production.

Change the schema by adding new tables, removing tables, and altering existing tables.

Replace a failed node. Note how long it takes to bootstrap a new node.

Run a real replication factor on at least 5 nodes

Ideally run the same number of nodes that you expect to run in production. Moreover, if you’re
planning on running only 3 nodes in production, test with 5. You may find that the queries
that were fast with 3 nodes are much slower with 5 nodes. In this scenario, secondary
indexes as a common query path are exposed as being slower at scale.

Be sure to use the same replication factor as in production.

Simulate the production datacenter setup

Each datacenter adds additional write costs. If your production cluster has more datacenters
than your test cluster, your writes might be greater than you anticipate. For example, if you
use RF 3 and test your cluster with 2 datacenters, each write goes to 4 nodes in 2 data centers.
If your production cluster uses the same RF and has 5 datacenters, each write goes to 7 nodes
with 5 data centers. This means writes are now almost twice as expensive on the coordinator
node because in each remote datacenter, the forwarding coordinator sends the data to the other
replicas in it’s datacenter.

Simulate the network topology

Simulate your network topology with the same latency and pipe. Test with the expected TPS. Be
aware that remote links can be a limiting factor for many use cases and expected bandwidth.

Once you’ve determined your target per node density, add data to the test cluster until the
target capacity is reached. Test for the following:

What happens to your SLAs as the target density is approached?

What happens if you go over your the density?

How long does bootstrap take?

How long does with repair take?

Use the cassandra-stress tool for testing

The cassandra-stress tool for DSE and DDAC is a Java-based stress testing utility for cluster load
testing and basic benchmarking. Use this tool to test hardware choices and discover issues
with data modeling. The cassandra-stress tool also supports a YAML-based
profile for testing reads, writes, and mixed workloads, plus the ability to specify schemas
with various compaction strategies, cache settings, and types. Be sure to test common
administrative operations, such as bootstrap, repair, and failure.

DSE Search stress demo

Note: This functionality is not available for the DataStax Distribution of Apache Cassandra™.

The solr_stress demo is a benchmark tool that simulates the number of requests for reading
and writing data on a DSE Search cluster using a specified sequential or random set of
iterations. See MBeans search demo.

The default location of the demos
directory depends on the type of installation:

Package installations: /usr/share/dse/demos

Tarball installations: installation_location/demos

CAUTION: The hardware used for load testing should mimic production environment,
by parameters and number of nodes. Performing load testing on a single node, or in
environments where the number of nodes is equal to the replication factor, will result in
incorrect measurements.