The Repair Service runs as a background process. The Repair Service cyclically repairs a DataStax Enterprise cluster
within the specified completion time. This overview describes the Repair Service behavior and its response to changes
in cluster topology or schemas.

Repair Service overview

The Repair Service runs as a background process. The Repair Service cyclically repairs
a DataStax Enterprise cluster within the specified completion time. This overview describes the
Repair Service behavior and its response to changes in cluster topology or schemas.

The Repair Service repairs small chunks of a cluster in the background. The Repair Service
cyclically repairs a DataStax Enterprise cluster within the specified time to completion. Any
anticipated
overshoot of the targeted completion time is communicated with a revised estimate.
This overview also describes the Repair Service behavior and its response to changes in
cluster topology or schemas.

Repair Service Summary

The Repair Service automates the repair process for DSE clusters. There are two types of
repairs handled by the service: subrange and incremental.

The terminology repair is a bit of a misnomer. Repairs run by the Repair Service are
mainly synchronizing the most current data across nodes and their replicas, which includes
repairing any corrupted data encountered at the filesystem level. The Repair Service has the
ability to run both subrange and incremental repairs. By default, the Repair Service runs
subrange for most tables and can be configured to run incremental repairs on certain
tables.

Subrange repairs repair a portion of the data that a node is responsible for.
Subrange repairs in the Repair Service are analogous to specifying the -st
and -et options on the nodetool repair command, only the
Repair Service determines and optimizes the start and end tokens of a subrange for you. The
main benefit of subrange repair is more precise targeting of repairs while avoiding
overstreaming.

Incremental repairs only repair data that has not been previously repaired on tables
reserved and configured for incremental repair.

Subrange repairs operates on an exclusion (opt out) basis that can exclude certain
keyspaces and tables. Ignored tables for subrange repairs consist of those reserved by
OpsCenter and those configured by admins. Incremental repairs operate on an inclusion (opt
in) basis. Only those keyspaces and tables designated for incremental repairs are processed
during an incremental repair. Tables flagged for incremental repair include those built-in
by OpsCenter and those configured by admins.

If data is relatively static, configure incremental repair for those tables or datacenters.
If data is dynamic and constantly changing, use subrange repairs, excluding keyspaces and
tables as appropriate for an environment.

There is no crossover between subrange and incremental repairs: keyspaces and tables are
either repaired by a subrange or an incremental repair. Subrange and incremental repairs are
mutually exclusive at a table level. The Repair Service runs both repair types
simultaneously. Each repair type has its own timeline, which is tracked in their respective
individual subrange and incremental progress bars in the Repair Status summary.

Parameters

The time_to_completion parameter is the maximum amount of time it
takes to repair the entire cluster one time.

Note: Typically, you should set the Time to
Completion to a value lower than the lowest grace seconds
before garbage collection setting (gc_grace_seconds) on
your tables. The default for gc_grace_seconds is 10 days
(864000 seconds). OpsCenter provides an estimate by checking
gc_grace_seconds across all tables and calculating 90%
of the lowest value. The default estimate for the time to completion based
on the typical grace seconds default is 9 days. For more information about
configuring grace seconds, see gc_grace_seconds in the CQL
documentation.

The Repair Service might run multiple subrange repairs in parallel, but runs as few as
needed to complete within the amount of time specified. The Repair Service always avoids
running more than one repair within a single replica set; there is no overlap in repairs
between replica sets.

Estimating remaining repair time

If the Repair Service anticipates it cannot complete a repair cycle within the allotted
time to completion due to throughput, it displays a warning message and a newly estimated
time remaining to complete the repair cycle. The Repair Service does not adjust the
configured time to completion; it reports the revised estimate for completion without
stopping the repair in progress.

When the Repair Service estimates that it will not finish a repair cycle within the
configured time_to_completion, it triggers an ALERT in the OpsCenter Event
Log. The alert is also visible in the opscenterd.log, as
well as the Event Log in the Activities section of the OpsCenter UI. If email alerts or post-url alert notifications are configured, the
alert notifications are emailed or posted.

The error_logging_window configuration property controls both how often to log the
message and how often to fire the alert if the Repair Service continues to estimate that it
will not finish a repair in time.

Parallel vs. sequential validation compaction processing

The Repair Service runs validation compaction in parallel by default rather than
sequentially because sequential processing take considerably more time. The
snapshot_override setting controls whether validation compactions for
both subrange and incremental repairs are processed in parallel or sequentially. See Running validation compaction sequentially.

Restart frequency

The Repair Service pauses when it detects a topology change or schema change and then
restarts after a period of time. The restart period is controlled by the
restart_period configuration option, which defaults to 300 seconds (5
minutes). While paused, the Repair Service checks the state of the cluster periodically
using this period of time until it is able to reactivate.

Conditions under which the Repair Service does not run

A cluster with a single node is not eligible for repairs. Repairs make node replicas
consistent; therefore, there must be at least two nodes to exchange Merkle trees during the
repair process.

Repair Service behavior during environment changes

The following sections provide details on how the Repair Service behaves when there are
changes in the environment such as topology changes, down nodes, and OpsCenter restarts.

Cluster topology changes

The Repair Service is nearly immediately aware of any topology changes to a cluster. When a
change in cluster topology occurs, the Repair Service stops its current repair cycle and
waits for the ring to stabilize before restarting a new cycle. Before resuming repairs, the
Repair Service checks every 30 seconds by default for the cluster state. After the cluster
has stabilized, the checks for cluster stabilization cease until the next time
opscenterd is restarted. Configure the interval for the stable cluster
check with the cluster_stabilization_period option.

Topology changes include:

Nodes moving within a cluster

Nodes joining a cluster

Nodes leaving a cluster

Schema changes

When a schema change happens, the Repair Service pauses for five minutes by default, then
starts back up and immediately begins repairing new keyspaces or tables. Schema changes
include adding, changing, or removing keyspaces or tables.

Down nodes or replicas

A repair cannot run if any of the nodes in the replica set for that range are down. In the
case where an entire rack or data center goes down, it is likely that no repair operations
can be successfully run on the cluster. When one or more nodes are down, the Repair Service
continues to run repairs for ranges and keyspaces unaffected by the down nodes.

When there are no runnable repair operations remaining, the Repair Service waits for 10
seconds and checks again. The Repair Service repeats this for up to the value configured for
the max_down_node_retry option, which defaults to three hours based on the
max_hint_window_in_ms property in cassandra.yaml,
and then starts a new cycle. After the max_hint_window_in_ms is exceeded
for a down node, the recovery process for that node is to rebuild rather than rely on hint
replay. Therefore the Repair Service starts a new cycle to ensure that any available ranges
continue to be repaired and are not blocked by down nodes.

Note: To mitigate the performance implications of scanning the entire list of remaining repair
tasks, the scan for available ranges only scans the first
prioritization_page_size tasks (default: 512). The order of these tasks
is random, so if no available ranges are found in the first
prioritization_page_size, it is unlikely there are any available
ranges.

Persisted repair state when restarting opscenterd

At the end of each persist period (one hour by default), the current state of the
Repair Service is persisted locally on the opscenterd server in the persist
directory location. The persist period frequency can be configured with the
persist_period option. The persist directory location can be configured
with the persist_directory option. When opscenterd is
restarted, the Repair Service resumes where it left off based on the persisted state
information.

The Repair Service runs as a background process. The Repair Service cyclically repairs a DataStax Enterprise cluster
within the specified completion time. This overview describes the Repair Service behavior and its response to changes
in cluster topology or schemas.

Basic configuration procedures for running repairs with the OpsCenter Repair Service. Specify keyspaces and tables for
subrange repairs to exclude from the subrange repair process. Specify tables for incremental repairs to include in the
incremental repair process.

Using trend analysis and forecasting, the Capacity Service helps you understand cluster performance within its current environment
and workload, and gain a better sense of how time affects those trends, both past and future.

The OpsCenter Performance Service combines OpsCenter metrics with CQL-based diagnostic tables populated by the DSE Performance
Service to help understand, tune, and optimize cluster performance. Visually enable the performance objects and analyze
the results within OpsCenter.

DataStax is a registered trademark of DataStax, Inc. and its subsidiaries in the United States and/or other countries.

Apache Cassandra, Apache, Tomcat, Lucene, Solr, Hadoop, Spark, TinkerPop, and Cassandra are trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries.