blog dds

2012.06.01

Why are AWS Command-Line Tools so Slow?

Amazon's Elastic Compute Cloud command-line tools are useful building
blocks for creating more complex shell scripts.
They allow you to start and stop instances, get their status,
add tags, manage storage, IP addresses, and so on.
They have one big disadvantage: they take a long time to run.
For instance, running ec2-describe-instances for six instances
takes 19 seconds on an m1.small AWS Linux instance.
One answer given,
is that this is caused by JVM startup overhead.
I found that hard to believe,
because on the same machine a Java "hello world" program executes in 120ms,
and running ec2-describe-instances --help takes just 321ms.
So I set out to investigate, and, using multiple tracing tools and techniques,
this is what I found.

By running the ec2-describe-instances shell script using the shell's -x
flag I found that ec2-describe-instances calls ec2-cmd.

So from then on I invoked Java directly to avoid the noise caused
by the shell scripts, as their time overhead was negligible.

I then checked the interaction between the command and the AWS web services.
I did this by running tcpdump and strace in parallel
and looking at the logs' timestamps.
The following command starts logging packets,
runs the Java program logging,
and then stops logging packets.
On an otherwise idle machine this gives you a very clear packet trace.

Reasoning that futex was simply showing the synchronization between
threads I decided to try to eliminate gettimeofday,
because of a reference I found regarding its effect on virtualized machines.
After some searching and reading the command's invocation code
I found I could eliminate the gettimeofday calls by setting
an environment variable.

export EC2_JVM_ARGS=-XX:-UsePerfData

This reduced the clutter in the strace file, but didn't improve
the performance.

I then tested how long it took to open and read all the files,
by writing a command sequence that got the file names from the trace
log and concatenated their contents.

I traced the cause to be the way the invocation shell script built the
class path by concatenating all jar files in the library.

for jar in "${LIBDIR}"/*.jar ;doCP="${CP}:${jar}"done

This sequence results in the class path,
stored in the CP variable, starting with a colon,
which is taken to mean to search in the current directory.
I fixed that problem by adding the following line after the for
loop:

CP=$(echo$CP| sed 's/://')

This seemed to improve the command's execution time, but only by
about a couple of seconds.

While examining the strace files I found many long read/seek sequences,
like the following.

This got me thinking about the performance implications of these,
and whether the seek calls were somehow interacting with
Amazon's Simple Storage Service (S3), which was used as the machine's
backing storage.
Searching for data on AWS's performance, I came across an excellent
technical report
written by Simson L. Garfinkel of Harvard's Computer Science Group.
There he mentioned that S3 performance depended on the size of the objects
stored.
Although the operating system's file system and buffer cache should
isolate the seek system calls from S3, I decided to test whether
using local storage would improve the performance.
I therefore modified the instance's startup command to attach some
local storage to the machine

I then considered whether all those seek/read calls were an
inefficiency of the OpenJDK that comes with Amazon's AWS Linux.
Could Sun's (now Oracle's) Java runtime be more efficient?
Based on instructions
by Dmitry Leskov I downloaded and run Orcacle's Java 1.7,
but the performance appeared to deteriorate a bit.
Another dead-end.

In the end it turned out that all the solutions I tried produced negligible or
(in the case of Oracle's JVM negative) results:

removing thousands of calls to gettimeofday,

optimizing Java's library path,

using local storage,

using Oracle's JVM.

Apparently, AWS command-line tools are indeed slow due to Java's
long startup time.
Dynamic class loading ensures that running small programs or even
an AWS command with the -help flag isn't slow,
but when the command needs to do some actual work and thus loads
all the required class files, it takes many seconds to run.

So what can one do?
Here are some things that do work.

Use a faster multi-core machine:
an m1.large instance runs the tools twice as fast as an m1.small one.

Make the command work harder for you. For instance, pass multiple instance ids to ec2-terminate-instances, or multiple tags to ec2-create-tags,
instead of calling each command multiple times.

Use the API from another language.
As an example, you can obtain
your machine's instance id in just 10ms by using the following shell command.

curl -s http://169.254.169.254/2011-01-01/meta-data/instance-id

If you're looking for the answer on why the AWS command-line tools are slow
I hope that this blog entry has saved you the expense of running
down a few blind alleys.
Otherwise, I hope you found out how you can use tracing tools to
investigate performance problems.