Monitoring Apache Spark applications running on Amazon EMR

This is a guest post by Priya Matpadi, Principal Engineer at Lookout, a mobile-first security platform for protecting mobile endpoints, consumer-facing apps, and more. This post originally appeared on her blog.

We recently implemented a Spark streaming application, which consumes data from from multiple Kafka topics. The data consumed from Kafka comprises different types of telemetry events generated by mobile devices. We decided to host the Spark cluster using the Amazon EMR service, which manages a fleet of EC2 instances to run our data-processing pipelines.

As part of preparing the cluster and application for deployment to production, we needed to implement monitoring so we could track the streaming application and the Spark infrastructure itself. At a high level, we wanted ensure that we could monitor the different components of the application, understand performance parameters, and get alerted when things go wrong.

In this post, we’ll walk through how we aggregated relevant metrics in Datadog from our Spark streaming application running on a YARN cluster in EMR.

Why Datadog?

The Spark UI provides a pretty good dashboard to display useful information about the health of the running application. However, this tool provides only one angle on the kind of information you need for understanding your application in a production environment. And although metrics generated by EMR are automatically collected and pushed to Amazon’s CloudWatch service, this data is more focused on running MapReduce tasks on the YARN cluster, rather than Spark streaming applications.

Luckily, Datadog provides built-in integrations for monitoring both Amazon EMR and Spark. Moreover, we were already using Datadog monitoring for application and cluster monitoring, so monitoring our Spark streaming application in the same platform was a natural choice.

Integrating Datadog with EMR

Setting up the Datadog integration with EMR is pretty straightforward. All the steps and metrics you can graph are documented nicely here. Just make sure your Datadog account is linked to your relevant AWS account, and has permission to pull metrics. Here is what the process looks like, in short:

Step 2: Ensure the AWS role specified in the Configuration tab has List* and Describe* permissions for Elastic MapReduce.

To make it easier to filter and aggregate your metrics, you can apply tags to your EMR cluster using the AWS console:

Within a few minutes, your cluster’s metrics should become available in Datadog.

Collecting Spark metrics in Datadog

Next, we’ll show you how you can set up your EMR cluster to publish Spark driver, executor, and RDD metrics about the Spark streaming app to Datadog. You can read more about Datadog’s Spark integration here.

To install the Datadog Agent on our cluster and enable the Agent’s Spark check, we leveraged EMR Bootstrap actions. From the AWS documentation:

You can use a bootstrap action to install additional software on your cluster. Bootstrap actions are scripts that are run on the cluster nodes when Amazon EMR launches the cluster. They run before Amazon EMR installs specified applications and the node begins processing data. If you add nodes to a running cluster, bootstrap actions run on those nodes also.

Setting up the Spark check on an EMR cluster is a two-step process, each executed by a separate script:

Install the Datadog Agent on each node in the EMR cluster

Configure the Datadog Agent on the master node to run the Spark check at regular intervals and publish Spark metrics to Datadog

Examples of both scripts can be found below. The examples below require that both scripts have been uploaded to S3 under <s3-bucket-name>/bootstrap-actions/.

Install the Datadog Agent on EMR nodes

The first script, emr-bootstrap-datadog-install.sh, is launched by the bootstrap step during EMR launch. The script downloads and installs the Datadog Agent on each node of the cluster. Simple! It then executes the second script, emr-bootstrap-datadog-spark-check-setup.sh, as a background process. Note that the first script requires four positional arguments:

Configure the Datadog Agent on the master node

Why do we need to run the configuration step in a separate script? Remember that bootstrap actions are run before any application is installed on the EMR nodes. The first script installed new software (the Datadog Agent), but the second step requires that YARN and Spark are installed before the Datadog configuration can be completed.

yarn-site.xml does not exist at the time that the Datadog Agent is installed. Hence we launch a background process to run the Spark check setup script. It waits until yarn-site.xml is created, and contains the value for the YARN property resourcemanager.hostname. Once that property is found, the script proceeds to create the spark.yaml file and moves it under /etc/dd-agent/conf.d. Then it sets the appropriate permissions on spark.yaml, and restarts the Datadog agent to load the configuration change.

Invoke install and config scripts via bootstrap actions

You can launch an EMR cluster programmatically, via the AWS CLI, or in the AWS console. Any of these methods offers the option to invoke bootstrap actions. Refer to the AWS documentation for a guide to invoking bootstrap actions while launching clusters from the AWS Console or via the AWS CLI. Below, you can see how we invoked our bootstrap action script (written in Scala) while launching EMR cluster programmatically.

Validate that the integration is properly configured

Finally, to confirm that the bootstrap actions completed successfully, you can check the EMR logs in the S3 log directory you specified while launching the cluster. Bootstrap action logs can be found in a path following this form:

Monitoring Spark application metrics in Datadog

Now that we have the Datadog Agent collecting Spark metrics from the driver and executor nodes of the EMR cluster, we have also laid the groundwork to publish metrics from our application to Datadog. Because the Datadog Agent is now running on your cluster nodes, you can instrument your application code to publish metrics to the Datadog Agent. As the application developer, you have the best picture of your application’s function, its business logic, and its downstream dependencies, whether it be a database, an API server, or a message bus. Therefore, you have the best idea of what metrics will be useful to monitor the health and performance of your Spark streaming application.

To start collecting custom application metrics in Datadog, launch a reporter thread in the initialization phase of your Spark streaming application, and instrument your application code to publish metrics as events are processed by the application. Spark’s metric system is based on the Dropwizard Metrics library, so you can use a compatible client library like the open source metrics-datadog project to route those metrics to Datadog.

Spark monitoring from all sides

To recap, in this post we’ve walked through implementing multiple layers of monitoring for Spark applications running on Amazon EMR:

Enable the Datadog integration with EMR

Run scripts at EMR cluster launch to install the Datadog Agent and configure the Spark check

Set up your Spark streaming application to publish custom metrics to Datadog

Once you’re collecting data from your EMR cluster, your Spark nodes, and your application, you can create a beautiful dashboard in Datadog combining all this data to provide visibility into the health of your Spark streaming application.