This section describes different storage options for MapR Monitoring. The MapR Control System (MCS) relies on MapR Monitoring components to display MCS metrics, but it can function without the monitoring components. Using MapR Monitoring to store logs is optional.

Priority 1 - Maximize Fault Tolerance

Follow these best practices to ensure that your MapR cluster can tolerate failures:

Ensure an odd number of ZooKeeper services. ZooKeeper fault tolerance depends on a
quorum of ZooKeeper services being available. At
least three ZooKeeper services are recommended. For a higher level of fault tolerance,
use five ZooKeeper services. With five zookeepers, two can be lost and quorum
maintained.

For other services, it makes sense for them to be at least as reliable as ZooKeeper.
Generally, this means at least two instances of the service for three ZooKeepers and
three instances for five ZooKeepers.

Include enough CLDBs to be as reliable as ZooKeeper. Because CLDBs use a master-slave configuration, a MapR cluster can function with
an odd or even number of CLDBs. The recommended minimum number of active CLDBs is two.
To tolerate failures, more CLDBs are needed:

If you have three ZooKeepers, configure at least three CLDBs.

If you have five ZooKeepers, configure at least four CLDBs. With four CLDBs, the
cluster can tolerate two CLDB failures and still provide optimal performance. Adding
a fifth CLDB does not increase failure tolerance in this configuration.

Include enough Resource Manager processes to be as reliable as ZooKeeper. Only one
Resource Manager is active at a time:

If you have three ZooKeepers, you need at least two Resource Managers.

If you have five ZooKeepers, you need at least three Resource Managers. Three
Resource Managers can survive the loss of two ZooKeepers.

For most MapR clusters, the recommended configuration is:

Three ZooKeepers

Three CLDBs

Two or three Resource Managers

For larger clusters, increase the number of CLDBs or ZooKeepers for better
performance or higher reliability. Table 1 shows the number of failures tolerated by
various combinations of ZooKeeper, CLDB, and Resource Manager services.

1For optimal failure handling, the minimum number of CLDBs is three; hence,
three or more CLDBs are recommended. With two CLDBs, the failure of one does not result in
an outage, but recovery can take longer than with three.

2Using five CLDBs does not improve fault-tolerance significantly when compared
with four CLDBs. But it can be convenient to have the same number of CLDBs as
ZooKeepers.

Priority 2 - Minimize Resource Contention

Every service on a node represents a tax on the resources provided by that node. Spreading
services evenly across nodes maximizes performance and helps to keep failures isolated to
failure domains. Because of power and networking considerations, a rack is usually the most
common failure domain.

Follow these best practices to avoid performance bottlenecks:

Spread like services across racks as much as possible. While not necessary, it is also
convenient to put them in the same position, if possible.

To maximize availability, use three or more racks even for small clusters. Using two
racks is not recommended. If a cluster has three ZooKeepers, using two racks means one
of the racks will host two ZooKeepers. In this scenario, a loss of a rack having two
ZooKeepers can jeopardize the cluster.

For services that are replicated, make sure the replicas are in different
racks.

Put the Resource Manager and CLDB services on separate nodes, if possible.

Put the ZooKeeper and CLDB services on separate nodes, if possible.

Some administrators find it convenient to put web-oriented services together on nodes
with lower IP addresses in a rack. This is not required.

Avoid putting multiple resource-heavy services on the same node.

Spread the following resources across all data nodes:

Clients

Drill

NFS

Priority 3 - Promote High Availability

Whenever possible, configure high availability (HA) for all services, not just for services
that provide HA by default. The CLDB, ZooKeeper, Resource Manager, and Drill provide HA by
default. Some services are inherently stateless. If possible, configure multiple instances
of these services:

Large clusters increase CLDB and Resource Manager workloads significantly. In clusters of
50 or more nodes:

Use dedicated nodes for CLDB, ZooKeeper, and Resource Manager.

Note: Dedicated nodes
have the benefit of supporting fast fail-over for file-server operations.

If fast fail-over is not critical and you need to minimize hardware costs, you may
combine the CLDB and ZooKeeper nodes. For example, a large cluster might include 3 to 9
such combined nodes.

If necessary, review and adjust the hardware composition of CLDB, ZooKeeper, and
Resource Manager nodes. Once you have chosen to use dedicated nodes for these services,
you might determine that they do not need to be identical to other cluster nodes. For
example, dedicated CLDB and ZooKeeper nodes probably do not need as much storage as
other cluster nodes.

Avoid configuring Drill on CLDB or ZooKeeper nodes.

Example Clusters

The following examples are reasonable implementations of the design priorities introduced
earlier in this section. Other designs are possible and may satisfy your unique environment
and workloads.

Example 1: 6-Node Cluster

Example 1 shows a 6-node cluster contained in a single rack. When only a single rack is
available, this example can work for small clusters. However, the recommended best practice
for all clusters, regardless of size, is to use three or more racks, if possible.

Example 1a. Core and Hadoop for 6-Node Cluster

Example 1b. Ecosystem Components for 6-Node Cluster

3Total cells show the total number of Core, Hadoop, and Ecosystem
components installed on each host node for the example cluster.

*Denotes a service that is lightweight and stateless. For greater performance, consider
running these services on all nodes and adding a load balancer to distribute network
traffic.

Example 2: 12-Node Cluster

Example 2 shows a 12-node cluster contained in three racks:

Example 2a. Core and Hadoop for 12-Node Cluster

Example 2b. Ecosystem Components for 12-Node Cluster

3Total cells show the total number of Core, Hadoop, and Ecosystem
components installed on each host node for the example cluster.

*Denotes a service that is lightweight and stateless. For greater performance, consider
running these services on all nodes and adding a load balancer to distribute network
traffic.

Example 3: 50-Node Cluster

Examples 3 shows a 50-node cluster contained in five racks:

Example 3a. Core and Hadoop for 50-Node Cluster (Racks 1-3)

Example 3b. Core and Hadoop for 50-Node Cluster (Racks 4-5)

Example 3c. Ecosystem Components for 50-Node Cluster (Racks 1-3)

3Total cells show the total number of Core, Hadoop, and Ecosystem
components installed on each host node for the example cluster.

*Denotes a service that is lightweight and stateless. For greater performance, consider
running these services on all nodes and adding a load balancer to distribute network
traffic.

Example 3d. Ecosystem Components for 50-Node Cluster (Racks 4-5)

3Total cells show the total number of Core, Hadoop, and Ecosystem
components installed on each host node for the example cluster.

*Denotes a service that is lightweight and stateless. For greater performance, consider
running these services on all nodes and adding a load balancer to distribute network
traffic.