Cluster Performance Profiling

Cluster Performance Profiling

Downloads

You can download the profiler binaries, scripts and samples from this link.

What is Profiling?

Most of you heard about VisualVM, YourKit, hprof and others. Using these tools you can see what parts of your application are working slowly, discover the reasons of why they are slow, and fix your code.

For example, YourKit profiler looks like this:

Profiler gathers information about how much time each class/method executes and visualizes it to you in a form of a tree. As a result, you can see the execution time in milliseconds and % of CPU time consumption among all the rest code.

It is very useful to profile EVERY application you develop, and it is worth doing it during development phase, but not when the app is released and the customer complains that it works slowly.

How Do Profilers Work?

There are two major ways how a profiler can work: instrumenting and sampling.

Instrumenting is adding extra bytecode to your methods for recording when they’re called and how long they execute.

Sampling is taking dumps of all the threads periodically in order to understand how much CPU time each method takes.

Both methods have their own pros and cons, but the main point is the fact that we can see what runs slowly. This is the main reason why we need a profiler.

How to Profile a Cluster?

We all are used to local profiling when the app and the profiler are launched on the developer’s workstation. But what shall we do in the case if we have a distributed application that runs in a cluster, e.g. a Hadoop application?

It took me time to find a good way of profiling a distributed application, but I found one: the StatsD JVM Profiler. There are two articles describing how StatsD JVM profiler works:

A “javaagent” is a piece of code that is used to instrument the programs running on the JVM. Profiler’s javaagent gathers stacktraces from JVM processes 100 times per second and sends them to a dedicated host running a NoSQL database called InfluxDB (https://influxdb.com). Therefore, the profiler works by sampling method.

After we run a distributed app and after the stacktraces are gathered, we run a set of scripts on this database to extract data about class/method execution and to visualize this data using Flame Graph.

Imagine you have some java class with a method A that called method B, B called C, C called D. We’ll denote a call by an arrow, so A->B->C->D. Every 10 milliseconds a special javaagent (embedded into JVM process) gets all stacktraces of all the threads of a process. For example, it may “see” that the main thread of our app is calling A->B->C (method C is executing now), at another moment of time it observes that A->B->C->D is the current stacktrace (method D is executing now), at the next moment of time it sees that it is in A->B->C->D again, and at the next moment we are in A->B (so that the code of B is executing).

If we put all stacktraces vertically, we can visualize it like this:

D

D

C

C

C

B

B

B

B

A

A

A

A

0th ms

10th ms

20th ms

30th ms

The time goes from left to right column, and every stack trace is shown from bottom to top. Therefore, in the method D we spent 20 milliseconds, C was executing just 10 milliseconds on its own, and method B was executing 10 milliseconds. If we group the cells so that the same parent method call are joined, we get this:

D

D

C

C

C

B

B

B

B

A

A

A

A

0th ms

10th ms

20th ms

30th ms

To summarize, the flame graph is an approximate visualization of what is going on: which class/method is executing how much % of CPU time. From the diagram above we can conclude that:

Method D was executing 50% of the whole CPU time;

Method C was executing 75% of CPU time, among which 25% of time it was executing its own code, and the rest 50% it spent in the call of D;

Method B was executing 100% of CPU time (same as A), but just 25% of time method B was executing itself, the rest of time it was calling other methods.

Sampling has some sort of a bias: in the case if D called E and E finished execution before we captured stacktraces, we won’t see E’s execution. However, sampling doesn’t add much overhead: there are no additional java bytecode instructions added to all the rest threads, but instead there exists just one additional thread that gathers stacktraces.

Achieved Results

Before describing in details how to install all the stuff and how to profile the Hadoop Cluster, let us share some result obtained at one of our customers, so that you’d better understand the benefits.

A customer of ours had a crawling application that periodically downloaded hundreds of websites and parsed data about sales of cars. The application leveraged apache Nutch as the crawling engine, and the development team made their own fork of it having rewritten the code responsible for data extraction and insertion of this data into the database.

The customer complained on slow performance of the crawler: they didn’t like that the crawler could get only 100 sites crawled during a day. They wanted to get much bigger figure and expected to crawl thousands of websites during a day in the same cluster.

The first question that interested my mind was “what is executing slowly there?”. To see this I had to do all those steps described below to profile the application, and I found some interesting things:

You can see above just one of the obtained flame graphs: this is the graph for fetcher job. The other ones you will be able to find in the attachment to this article.

Now let us summarize what I’ve discovered:

The Gora library, used to insert data into HBase, is terribly slow or is used improperly: it does redundant full scans of metadata every time it needs to insert a piece of data into HBase. And it does this in ALL the jobs, not just in the fetcher.Actions performed: development team started investigating the Gora performance problem in order to find out the reasons of its slow behavior. Different ideas of improving the situation appeared (e.g. to replace HBase with Cassandra and to see what would happen; to replace Gora with another ORM, etc.), and at the moment of writing of this article development team is in progress on this task.

Hadoop spends 18% of CPU time of the executed reducer job on … reading XML configuration files (mapred-site.xml and the rest). This sounds very weird, I know, but the profiler doesn’t lie: the code is executing in the Xerces library methods for so long time.

Actions performed: development team is reading manuals, apache web site and source code in order to find out how configuration files are parsed. The reason is still not found (at the time of writing of this article), but we all hope that it is somewhere in the wrong configuration.

A lot of time is spent on HDFS initialization (the third column of the picture above). The stacktraces of the method calls show us that the namenode is responding very slowly. Further analysis with the help of IOPS benchmark and different file copying operations together with measurement of performance counters of Cloudera Monitoring Services clearly indicated that there is a performance problem with disk I/O on small files, and a problem with network I/O between some hosts.

Actions performed: a support request was submitted to the system administrator team with clear evidence of all the found hardware issues.

14.75% of CPU time was spent on real work (parsing of HTML files), but from the generated profiler files we can easily see that every time the fetcher job was run, it used JUST ONE reducer.

Actions performed: a dedicated developer is now in the process of investigation of Nutch code in order to find out how to rewrite it so that it will be using different partitioning function.

After profiling the application we could clearly see what code works slowly, and we had some directions of further analysis an investigation to improve the situation. If we look at the flames for all phases, the perspective is to cut off A LOT of execution time by speeding up blocks of code that repeat and work abnormally slowly:

Installation

Building StatsD JVM Profiler

The first thing you need to do is to build the JVM profiler. You can simply get the files attached to this document and avoid any compilation. However, if you want to have the latest version, you can make a build on your own.

You will get a file “statsd-jvm-profiler-0.8.3-SNAPSHOT.jar” in the “target” subdirectory. As you may notice by the URL in the “git clone” command, we used my own fork of the StatsD JVM Profiler. In my modification I added several features to the original version, in particular:

Added gathering of the JVM ID and hostname and pushing them into InfluxDB under “jvmName”, “host” and “pid” tags;

Added error handling for OOM exceptions;

Added gathering the number of CPU traces and the number of bytes in them (to see if there are no dangers that the profiler is consuming too much memory);

Optimized some code to decrease memory usage;

Python extractor scripts was rewritten to generate separate files for every JVM process.

Distribute the JAR File

You need to put this jar file on EVERY machine of the cluster into /var/lib directory. I know this is unpleasant task, but it has to be done. You may use the script below in order to speed up this process:

Depending on your permissions, this script may not work for you, so if this is the case, consult with the administrators of the cluster and ask them to assist you in putting the file to /var/lib of every machine.

Install InfluxDB

You need to install a dedicated machine for cluster profiling. This machine will host the InfluxDB database, and on this machine you will perform the analysis of the logs. DO NOT USE ONE OF DATANODES, NAMENODE OR ANY OTHER MACHINE WITH THE HADOOP SERVICES. This should be a SEPARATE DEDICATED MACHINE with the same CPU and RAM as your datanodes are.

After that check the web interface of InfluxDB: go to http://localhost:8083/ (write the name of your profiler server instead of “localhost”) in a browser and see this:

You may run command “show databases” in the “Query” box to see that it is working properly (it will notify you that there are no databases yet).

REMEMBER FOR THE FUTURE!If you ever restart the InfluxDB service (e.g. after system restart) and if it contains a huge database, the service will start VERY SLOWLY. I don’t know the reason, maybe because it performs initialization of some data in memory. However, don’t panic and don’t think that it doesn’t work! Just read the log in /var/log/influxdb and wait for appearance of this line:

Create a Database

You may create a database either using the web interface (http://localhost:8083) or by running the shell using command

Thus, we’ve created the database called “profiler” and a user with the same name and the same password.

Change Hadoop Settings

Our customer was using Cloudera distribution of Hadoop, and there was one nice configuration parameter of YARN in the Cloudera Manager:

Type the word “environment” in the search box at the left top and press “enter”. This action will filter all configuration parameters to show the ones which contain this word. We need to type in the following string into the box “YARN (MR2 included) Service Environment Advanced Configuration Snippet (Safety Valve)”

Change the red parts to contain the values of your environment (read the following paragraph for detailed explanation about _JAVA_OPTIONS). After you changed this setting, press on “Save Changes” button, and (VERY IMPORTANT!) press the button that will appear here:

_JAVA_OPTIONS Explained

_JAVA_OPTIONS is the environment variable that will be set up before a new JVM process is launched. Hadoop launches application masters and containers, and these JVM processes will get this environment variable if it is configured properly.

The value of this variable will bind the javaagent to the process. This javaagent has a premain() method that will execute before the main(). The premain() method of the StatsD Profiler will run a thread that will monitor the stacktraces. You may see the source code in Agent.java file.

As you may notice, there are parameters of javaagent, let us split them line-by-line for convenience:

“server” is the IP address or hostname of the server where you installed InfluxDB. Port 8086 is the default port which is listened by InfluxDB. “Reporter” is the fixed value that should be equal to “InfluxDBReporter”. The StatsD profiler can connect to two possible back-ends – StatsD and InfluxDB, but we will be using only the second option.

Database, username and password are credentials at InfluxDB (we created a database with name “profiler”, and the username with the same name and password).

“Prefix” and “tagMappings” are the most confusing parts. InfluxDB has a notion of tags, you may read about them here: https://influxdb.com/docs/v0.9/concepts/key_concepts.html . Each measurement in the InfluxDB can have one or more tags specified. By setting the prefix and tagMappings as shown above, you will have tags set like this: tag1=value1, tag2=value2, tagN=valueN. You may set up as many tags as you want in the configuration string – just split them with dots. This is done for purpose: during one profiling session you will set SOME tag/value pairs, and during another profiling session you will set ANOTHER tag/value pairs. Later it will be more convenient for you to analyze the stacktraces using these tags.

-javaagent Explained

Placing the –javaagent parameter into _JAVA_OPTIONS is just one way to bind the agent. Another way is to make somehow that all the java processes are running using

java -javaagent:/var/lib/… -jar file.jar …

So, if you have NOT a Hadoop environment and if you can manipulate by the command line parameters of the child java processes running in the cluster, then just make so that “-javaagent” is added to the appropriate command lines.

Another option is to make so that _JAVA_OPTIONS environment variable is globally visible to all processes. One way to achieve this is to write the line

export _JAVA_OPTIONS=-javaagent:/var/lib/…

inside the file /etc/environment of every machine and reboot the cluster. But this way has one problem: in the case with Cloudera distribution they WON’T set this environment for application masters and containers because the latter are launched by the YARN node managers in a “safe” way: all the variables are cleaned and explicitly set before running child java processes.

Conclusion: putting the _JAVA_OPTIONS into /etc/environment doesn’t guarantee that they will be obtained by ALL java processes.

Other Hadoop Distributions

I never configured environment variables for other Hadoop distributions (e.g. for Hortonworks), but I believe it is not hard to find the place where we should put this environment variable. For HDP 2.3 I see it here in the Ambari Server web interface:

Most probably this is the right location for putting the same string with _JAVA_OPTIONS.

Profiling the Cluster

Getting The Stack Sraces

The sequence of steps to get the stack traces of executing processes is as follows:

Make sure you distributed the statsd-jvm-profiler-0.8.3-SNAPSHOT.jar to /var/lib of every machine. After copying this file, open SSH session to some of the machines and simply check if the file is there. It is preferable to run “md5sumstatsd-jvm-profiler-0.8.3-SNAPSHOT.jar” on the remote machines and on your local machine – to make sure that the same file is at both places.

Make sure you installed InfluxDB on a separate machine and created a database “profiler” with username “profiler” and the same password, as described earlier in this document.

Make sure you set the _JAVA_OPTIONS variable into your Hadoop distribution as described in previous sections. If you’re using something else but not Hadoop, then make sure your child java processes will either see _JAVA_OPTIONS environment variable or they will have explicitly set -javaagent parameter when launched.

Run the job(s) in your cluster. After 10-20 seconds of job running launch the command “show measurements” on the InfluxDB database – it will show you the stacktraces as measurements.
NOTE: THE COMMAND MAY RUN SLOWLY – just wait for results several seconds (the browser interface will be non-responsive during this period). If the stacktraces don’t come into the InfluxDB database, then check the following:

Check if your java processes get the _JAVA_OPTIONS variable, and if the value of this variable (the parameters of -javaagent) is correctly set: it contains the correct servername, correct credentials, etc.

Check if the profiler machine can be pinged from the datanodes.

After the job in your cluster finishes, you may “switch off”. For Cloudera’s distribution just clear the edit box where you entered _JAVA_AGENT, “Save Changed” and re-deploy cluster configuration (button at the top right corner of the same page). For other distributions there should exist other ways how to “unplug” the _JAVA_AGENT.

Converting Stack Traces into SVG Files

In order to build the flame graphs shown in the beginning of this article, we need to do two steps:

In your case you need to make several changes: change the servername, credentials, and tag name/values to match the parameters in the string which you passed to _JAVA_OPTIONS when you set up the agent.

This command will run for a long time: it will be producing files with “tr_” prefix. You will get one file for every JVM process that was launched in the cluster and that was profiled.

In the name of the file first goes the prefix – you set it with -x “tr_” parameter of the python script. Then the date and time of process execution is recorded. In fact, this is the time of first receiving of stacktraces by StatsD server, what is 10 seconds later than the start of the JVM. Then we can see the hostname. And then the PID of the process goes.

So, after running the influxdb_dump.py you got a set of txt files with stacktraces.

Now it is necessary to convert these traces to the SVG files. The script flamegraph_files.sh looks like this:

This script enumerates the files beginning from tr and runs the flamegraph.pl on top of them.

But before passing lines to the flamegraph.pl, it filters those lines by filterlines.py. The latter will pass to the output only those stacktraces that DO NOT contain any of the strings enumerated in filter.txt. Here is the reason why we did the filtering code:

As you may notice, we don’t need the information about the threads that are just waiting.

But if you don’t want to filter anything, just change the line

./filterlines.py -f filter.txt $f | perl flamegraph.pl > $f.svg

onto the following one:

perl flamegraph.pl <$f > $f.svg

and then run flamegraph_files.sh.

ATTENTION!flamegraph.pl has one feature that may “blow your mind” if you don’t pay attention to it: the parameter
minwidth. This parameter, if not set, will be equal to 0.5. This is the minimal width of the flame graph’s bars: if the width is less than it, it won’t be included. So, if you wish to see every method call and if you don’t want the flamegraph.pl to filter anything, then change the line./filterlines.py -f filter.txt $f | perl flamegraph.pl > $f.svgonto this one:./filterlines.py -f filter.txt $f | perl flamegraph.pl –minwidth 0> $f.svgIf you do this, everything will be displayed in the SVG, but it may so happen than the SVG file will be too big and will be opened slowly in the browser. You may play with this parameter, e.g. set it to 0.05 or less.All parameters are described here https://github.com/brendangregg/FlameGraph

As a result of running the flamegraph.pl on top of those text files with stacktraces, you will get a set of SVG files near these txt files, and you may open any of them and perform the analysis.

Performing Analysis

We could omit writing this part, but I would like to share some thoughts about what you can do with the set of SVG files that you got.

First of all you should remember that the SVG files are interactive: you can open them in browser (preferably in Chrome) and play with them – click on the bars to drill down, and click on a bottom bar to drill up.

The first question that arises is “what shall I do with this bucket of files? How to find out what is what?”

The answer is next: you got SVG file for every JVM process running in the cluster, so you may match the name to the actual job that was running there:

If we look at the resource manager, I can see the time of the start of the job. There will be several SVG files corresponding to this record (one for the appmaster, several – for mapper attempts and for reducer attempts). Just look in the list of generated files and find those whose name contains the date/time equal to 10 seconds after this job started.

Second question that arises is next: how to find out what is the process – mapper, reducer, appmaster?

Answer: there is no other way out as to look inside the SVG file. Open it in a browser and you will see this:

When I looked through the files, I split them on folders by AppMasters, and the rest files (after appmasters) were the corresponding mappers and reducers. After marking the appmasters it was easy to split everything by folders and then to perform the analysis:

Unfortunately, it is impossible to determine this programmatically (in the code of profiler) because we’re running inside our separate statsd-jvm-profiler-0.8.3-SNAPSHOT.jar, and this code HAS NO IDEA about what JVM it is attached to.

The next question is “what to pay attention to when I do the analysis of flame charts”?

Answer: pay attention to those method calls which are wide on the flame graphs. Here is an example:

The blue markers show the method calls where I would drill down. To drill down a method it is necessary to simply click on it: the chart will open starting from this method and above it. It is easily visible that Gora’s library call to readFields() causes FileOutputStream.writeBytes() to work VERY long time. When I saw this, I made a benchmark of the file system. And here is what I’ve found using the IOPS http://benjamin-schweizer.de/measuring-disk-io-performance.html:

As you may notice, the speed at my virtual machine is 10 times faster than the speed of the datanodes. It tells to us that maybe it is really worth looking at the hardware configuration and talk to system administrators ASAP!

Well, there are a lot of scenarios of analysis, it is impossible to cover them all here. We’ll attach some of the documents that show what we’ve found so far together with this doc – you may look at them and get more ideas as to this.

Several Useful Scenarios

I would like to suggest two useful scenarios of management with the data of InfluxDB.

Scenario 1:

If you perform profiling of one version of the software and you wish to compare results with profiling of another version, I suggest to create different database in InfluxDB and change the profiler configuration string in _JAVA_OPTIONS to contain ANOTHER database name. If you do it like this, you will be able to extract data with influxdb_dump.py by specifying one database name, then move the svg files to a separate folder and run the influxdb_dump.py again with a different database name to get another set of files.

In my case I had a dedicated profiler machine provisioned temporarily, and I had to move the InfluxDB database backup to my own vm on a local machine to perform the analysis. Take into account that after restoring the InfluxDB database it is necessary to wait for a long time until it will become live: keep track on “top” command and look there how much CPU is consumed by influxd process. It if consumes 80-100% of CPU, it means that the database is in the process of loading it into memory, so don’t rush in fetching data from it until it goes live.

Conclusion

With the help of the profiler we can now see what is happening. To my mind, profiling is a “must have” in the process of every software development, and it really allows to see problems that were not visible without it.

If you have any questions or comments or ideas how to improve it or what to add – feel free to contact me.