One of the common challenges of deploying a search engine is keeping the search indexes synchronized with the source data. In some cases, a batch process using custom code to periodically index new documents is satisfactory, but in many enterprise environments today, real-time (or near real-time) synchronization is required.

In the 5.0 release of MapR, you can create external search indexes on columns, column families, or entire tables in MapR-DB into Elasticsearch and keep the indexes updated in near real-time. That is, when a MapR-DB table gets updated, the new data is instantly replicated over to Elasticsearch. As shown in this post, this capability only requires a few configuration steps to set up.

Uncomment the line and create your own cluster with the ES_NAME (in my example, “AbizerElasticCluster”).

Since we have nodes spun up in AWS and we would like to restrict the cluster communication, we would need to disable multicast discovery and configure an initial Elasticsearch master node to discover other Elasticsearch nodes (master or data) when they are added to the cluster.

To do so, you will need to modify the settings below in the elasticsearch.yml config file.

# Unicast discovery allows to explicitly control which nodes will be used
# to discover the cluster. It can be used when multicast is not present,
# or to restrict the cluster communication-wise.
#
# 1. Disable multicast discovery (enabled by default):
#
discovery.zen.ping.multicast.enabled: false
#
# 2. Configure an initial list of master nodes in the cluster
# to perform discovery when new nodes (master or data) are started:
#
discovery.zen.ping.unicast.hosts: ["ec2-54-151-49-244.us-west-1.compute.amazonaws.com"]

Note: Unicast discovery allows you to explicitly control which nodes will be used to discover the cluster. This is mainly used when you have a MapR cluster in a different subnet than the Elasticsearch cluster, or multicast is disabled. (You would only give one node details which would act as a transport node; MapR gateways will pass replicated updates from the source MapR cluster to the transport nodes. These nodes are responsible for distributing the updates to the correct nodes in the Elasticsearch cluster).

Run the command below to start Elasticsearch in the background (preferably under screen session)
/opt/elasticsearch-1.4.4/bin/elasticsearch –d &

Register your Elasticsearch cluster with MapR
The next step is to make the MapR cluster aware of the Elasticsearch cluster. This is done with the “register-elasticsearch” script.
Run below command (On MapR cluster Node) :

/opt/mapr/bin/register-elasticsearch -r ec2-54-219-214-156.us-west-1.compute.amazonaws.com -e /opt/elasticsearch-1.4.4 -u root -y -f –t
-r the IP address for the Elasticsearch node that needs to be registered
-e the home directory for Elasticsearch
-u the user who can login to ES_NODE and read all the files under the ES_HOME directory (default user is the user who is running the register command)
-y omit interactive prompts
-t notifies the cluster to use the nodes listed in -r as transport nodes
-f forces the registration of the Elasticsearch cluster

Wait until it finishes. Once the command is executed, you will have an Elasticsearch target cluster registered in your MapR cluster with messages as seen above. When you run the register script, it copies the Elasticsearch cluster’s configuration file (elasticsearch.yml), JAR files, and plugin JAR files into MapR-FS.

We can list the Elasticsearch cluster registered with our MapR cluster as below.

-path the source table path
-target the target Elasticsearch cluster name
-index the name of the index you want to use in Elasticsearch. In the RDBMS world this can be thought of as a database.
-type the name of the type you want to use within the Elasticsearch index.
In the RDBMS world, this can be thought of as a table.

This command registers the destination Elasticsearch type as a replica of the source table, copies the content of the source table into the Elasticsearch cluster via running CopyTable in the background, and then starts the replication stream to keep the Elasticsearch indexes up to date. Updates to the source table are replicated near real-time by the replication stream. Replication of data to Elasticsearch indexes is asynchronous.

Once the command above finishes successfully (it might take a while for huge tables), Elasticsearch replicas for the source table can be listed as below from the MapR cluster.