Some notes to share…

Monitoring NiFi – Ambari & Grafana

When using Apache NiFi (note that version 1.2.0 is now released!) as part of HDF, a lot of of things are simplified using Apache Ambari to deploy NiFi and manage its configuration. Also, using Ambari Metrics service and Grafana, you have a way to easily and visually monitor NiFi performances. And you can also use Apache Ranger to centralize the authorizations management for multiple components (NiFi, Kafka, etc) in one single place.

This article will discuss how you can use Ambari Metrics and Grafana to improve your NiFi monitoring. Let’s start with a quick discussion around AMS (Ambari Metrics System). By default this service is running a Metrics Collector with an embedded HBase instance (and a Zookeeper instance) to store all the metrics, and Ambari will also deploy Metrics Monitor instances on all the nodes of the cluster. The monitors will collect the metrics at system level and send the metrics to the collector. However, the collector also exposes a REST API and that’s what NiFi is going to use with the AmbariReportingTask.

When using HDF, the Ambari Reporting task should be already up and running for you. If not, you can add it and configure it with a frequency of one minute (it does matter) and use the following parameters:

Note that “ambari.metrics.collector.url” is an environment variable already set for you when Ambari is starting NiFi. You could also directly give the address, in my case:

Once this reporting task is up and running, you should be able to see the metrics on the NiFi service page in Ambari:

Also, you can go into Grafana to display dashboards with the metrics of your components. You have pre-configured dashboards and here is the one for NiFi:

Now, all the metrics we have here are at cluster level. We are not able to display metrics for specific workflows. With the latest release of Apache NiFi (1.2.0), there is now an optional parameter in the AmbariReportingTask to specify a process group ID. This way, by creating a second reporting task (keep the one providing cluster-level metrics) and by specifying the ID of a specific process group, you can actually create your Grafana dashboards at workflow level.

Let’s say I’ve the following workflow:

And inside my process group, I have:

Now, my process group having the ID “75973b6e-2d38-1cf3-ffff-fffffdea8cbc”, I can define the following Ambari reporting task:

Note – you must keep “nifi” as the Application ID as it has to match the configuration of the Ambari Metrics System.

Once your reporting task is running, in Grafana, you can create your own dashboard for this workflow and display the metrics you want:

For my Kafka example, here is the dashboard I defined:

In this example, I can see that my workflow is running fine but the free disk space on one of my node is decreasing very quickly. It turns out that when my disk is completely filled, back pressure will be enabled in my workflow and there is no more data sent to Kafka. Instead data is queued in NiFi.

This simple example gives me a lot of information:

Everything is default configuration in Ambari and I chose my three NiFi nodes to also host Kafka brokers. By default, for Kafka, the replication factor is set to 1, the number of partitions is set to 1 and the automatic creation of topic is allowed (that’s why I didn’t need to create the topic before starting my workflow). Because of the default parameters, all of the data is sent to only one Kafka broker (pvillard-hdf-2) and that’s why the disk space is quickly decreasing on this node since my three NiFi nodes are sending data to this broker.

Also, we clearly see that’s not a good idea to collocate NiFi and Kafka on the same nodes since they are both IO intensive. In this case, they are using the same disk… and we can see that the task duration (for NiFi) is clearly higher on the Kafka node that is receiving the data (pvillard-hdf-2). Long story short: keep NiFi and Kafka on separated nodes (or at the very least with different disks).

With HDF and the Ambari Metrics System, it gives you the ability to create custom relevant dashboards for specific use cases. It also allows you to mix information from Kafka, from NiFi and from the hosts to have all the needed information in one single place.

Also, by using the REST API of the Metrics Collector (you may be interested by this article), you could also send your own data (not only the data gathered at the process group level) to add more information into your dashboards. An example that comes in mind would be to send the lineage duration (see Monitoring of Workflow SLA) at the end of the workflow using an InvokeHTTP processor and sending a JSON payload using a POST request to the API endpoint.

Let’s say I want to monitor how long it takes between my GenerateFlowFile and the end of my workflow to check if some particular events are taking longer. Then I could have something like:

What am I doing here? I want to send to AMS the information about the lineage duration of the flow files I sent into my Kafka topic. However I don’t want to send the duration of every single event (that’s not really useful and it’s going to generate a lot of requests/data). Instead I want to make an API call only once per minute. The idea is to compute the mean and max of the lineage duration with a rolling window of one minute and to only send this value to AMS.

I could use the new AttributeRollingWindow processor but it is not as fast as the PublishKafka and I don’t want to generate back pressure in my relationships. So I use the InvokeScriptedProcessor to build my own rolling processor (it’s faster because I am not using any state information):

this processor takes a frequency duration as a parameter (that I’ll set to 1 minute in this example)

for every flow file coming in, it will extract the lineage start date to compute max and mean lineage duration over the rolling window. If the last flow file sent in the success relationship was less than one minute ago, I’ll route the flow file to drop relationship (that I set to auto-terminated). If it was more than one minute ago, I update the attributes of the current flow file with the mean and max of all the flow files since the last “success” flow file and route this flow file in the success relationship

Since I’ve flow files coming in my processor at a high rate, I know that my processor will release one flow file every minute with the mean and max of the linage duration for the flow files of the last minute.

Then I use a ReplaceText processor to construct the JSON payload that I’ll send to the Metrics Collector using the InvokeHttp processor.

Note that I use the ID of the processor as the “instanceid” attribute in the JSON.

Then, I use the InvokeHttp processor (with a scheduling/frequency of 1 minute):

Now, I can use this information to build the corresponding graph in my Grafana dashboard:

I can see that, in average, it takes about 150 milliseconds to generate my flow file, publish it in my Kafka topic and get it into my scripted processor. I could also generate one metric per host of my cluster to check if a node is performing badly compared to the others.

Now you can easily send your custom data into AMS and create dashboards for your specific use cases and workflows.

Basically the Ambari reporting task is trying to send metrics to the Ambari Metrics Service. This service needs to be running normally. Based on this error, it looks like a communication error between the reporting task and the metrics service. I honestly don’t know the details about the Hortonworks Sandbox but it could be something related to how the sandbox is packaged. I’d suggest asking the same question on the Hortonworks Community forum as you’ll have more chances to get a solution to your problem. Sorry about that!

However, the feature is under-documented. I cannot find any related documentation explaining each metric.

Can anyone tell what is the difference between BytesReceivedLast5Minutes and BytesReadLast5Minutes (and, analogously, I suppose, between FlowFilesSentLast5Minutes and BytesWrittenLast5Minutes)? I believe “Received” means “bytes received via input ports or from external sources such as file system, Kafka, etc.”, while “Read” is “the sum of all bytes read in any related IO operation within a Process Group”. Please correct me if I’m wrong.

Secondly, I cannot create a Process Group where FlowFilesSentLast5Minutes or FlowFilesSentLast5Minutes is not equal to 0. Can anyone elaborate what these metrics mean?

If you are looking at a process group, then you’ll have values for received/sent if data goes in/out of the process group through input/output ports. Regarding Read/Written, I believe this is the aggregation of all the read/write statistics of the components inside the process group.

What you see in Ambari is corresponding to the statistics you can see in the Status History view of the root process group (that you can access from Summary View / Process Group tab). When accessing the History Status, in the list of available metrics, for each metric, you have a “?” next to each metric with a short description of what is representing the metric. However you’re right, we could document it in the NiFi documentation and I opened a JIRA to improve the docs.

“values for received/sent if data goes in/out of the process group through input/output ports” – In fact this is not true. For instance, a Process Group with ZERO input ports can have BytesReceived greater than zero (i. e. flow starts from GetFile or ConsumeKafka).

I have analyzed UI code and it turned out that UI shows different metrics from those sent to Ambari via AmbariReportingTask. More precisely, AmbariReportingTask collects status.getFlowFilesSent(), status.getBytesSent() (analogously Received) while UI shows status.getInputCount(), status.status.getInputContentSize() (analogously Output).

That explains my confusion when I noticed the discrepancy between UI metrics and Ambari metrics. At that time, I tried to map Ambari metrics to metrics displayed in UI. I don’t know whether it’s on purpose or by mistake 🙂

[…] Then I split my JSON arrays and extract the field containing the name of the connection to only keep the connections suffixed by _GRAFANA. Then I transform the content to match the format expected by AMS and send the request to AMS (more details here). […]

Hi, that’s something provided out of the box when deploying NiFi with Ambari: the Ambari Metrics Service is collecting this kind of information for each host of the cluster. I did nothing on NiFi’s side for this one. But you could use a script to collect this information with NiFi and send it to the store you want.

Hi there, I am trying to monitor multiple NiFi instances that have already been deployed in separate AWS EC2 instances. Is it possible to set up an Ambari environment to monitor them? Your article seems to suggest the NiFi instance is created via Ambari. Thank you.

Hi. If if you didn’t deploy NiFi using Ambari / Hortonworks DataFlow platform, I’d rather recommend a different approach: using the S2S reporting tasks you could send the monitoring data into an Elasticsearch instance and use Grafana (or something similar to display the monitoring data). Note that when you’re using Ambari, NiFi is just using the Ambari Reporting Task to send the information to the Ambari Metrics Service, nothing more.

It’s not only Ambari that is needed but the Ambari Metrics Service (that is running an embedded HBase to store the metrics). Check that the URL you’re using is the one exposed by the Ambari Metrics Collector.

Grafana is really about dashboarding, you can have a lot of underlying databases and use various connectors. You could, for instance, use Prometheus (a reporting task to send monitoring data to Prometheus has been recently added into NiFi).