Pages

mardi 27 août 2013

Tweet
Today, I was asked on Twitter to have a look at this article by Brice LEPORINI and provide feedback. This could not fit in a "comments" section so I wrote an article. Here it goes !

The experiment

Brice wrote an experiment to calculate the cost of going immutable and comparing it versus being mutable.

Although very risky, he emphasizes that his results cannot be used to draw bad/flame/godwin conclusions like "multable is soooo better that immutable, you functional suckers !". There are a lot of flawed benchmark on the internet that claim to be the only truth, so let's also rejoice when experiments reports try to be objective, complete and come with a disclaimer about their limits.

My specs

I ran the code on my laptop, without any application running in the background (not even a X server) to make sute the system was quiet. I used OpenJDK 1.7.0_40. My CPU is a quad-core Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz.

Also, my memory bandwidth is about 2.5GB/s. This is an important information.

So, what can we conclude from these results ? Well, Not much, I'm afraid.

There are lots of problems with this experiment on my machine. One of those is that we are measuring things that take about 1 millisecond. This resolution is very very very low, and we cannot see significant differences at this timescale.

Also, having a look at the GC logs show that the GC durations, for PSYoungGen and ParNew are spread between 1ms and 10ms. Say it differently : a GC cycle take between 1 and 10 times the duration of an iteration. Therefore a GC cycle would have a huuuge impact of our microbenchmark.

Now, the real stuff. If we look at the "Parallel GC, 1mx, 1ms" GC log, we can see two different phases in memory utilization. Those two phases are the immutable test first and then the mutable test.

Now, if we look at the allocation rate, things get really interresting.

See that almost flat cloud of dots at ~2.5GB/s ? This means that at that time, the application was creating that much garbage. At the beggining of this article, I mentionned that my memory bandwidth was about 2.5GB/s, now you can see it clearly.

The ImmutabilityTest is actually limited by the size of the wires between my CPU and my memory chips. To say this differently, the ImmutabilityTest does not measure how fast we can create Immutable dates holder, it measure how fast the data can go from my CPU to my RAM.

Therefore, all my ImmutabilityTest results in the previous table are invalid.

Final Thoughts

It is quite often for microbenchmark not to measure different things than what we expect. This is another example of such misconceptions : the results were quite stable, they let us be confident of the experiment and yet were completely wrong.

I think Brice had noticed the limits of this benchmark, but did not push the analyze to its end. I hope this article added some clarifications and will give more materials for future experiments.

Btw, I did not mention CMS nor G1 because CMS never ran and because G1 needed to be tuned a bit. This could also be an interresting topic :-)

samedi 10 août 2013

Tweet
MOOC platforms like Coursera and MIT OpenCourseWare are fantastic. But sometimes the video provided are not really audible, especially when you are in noisy public transportation. This article shows how to solve this issue.

The issue

Take any course from MIT OCW. It is likely coming from a classroom, with some echo, little noise. The problem is that when you are in a bus and want to watch your video lessons, you are very easily disturbed by the environment (horns, brakes noises, people speaking loudly).

How can we transform a video so that the volume is higher ? ffmpeg+sox is a very efficient combination for that.

The process

First, we have to extract the audio track from the video, then amplify the volume, and finally rebuild the whole video using the amplified audio track.

Here, I assume that the file you want to modify is stored in ${video}.

To extract the audio track in a file named "${video}.wav", use this command :

ffmpeg -i $video ${video}.wav

Then, find the maximum amplification you can use without introducing noise :

Note that you may need to use the audio codec "libfaac", depending on your version of ffmpeg and how it was built. In that case, drop the parameter "-scrict -2" with is only required to enable the experimental audio codec "aac"

That's it. Now you can clean the temporary files that were generated :

mardi 4 juin 2013

Tweet
Recently, thanks to Kirk Pepperdine for the idea and a brainstorming session, I started working on a new open-source lock-profiler (Tyrion). Now that I have a first version working, it's time to write an article to explain the concepts for those who would want to join.

Why Tyrion ?

Because it is interresting !

Consider the following questions :

In my application, which locks are never used ?

In my application, which locks are used only by one thread ?

In my application, which locks are contended ?

To answer them, you have to dig inside the JVM and understand the inner-workings, you have to be curious, you eventually have to manipulate code that you did not write. These are challenges that sound fun !

How (high level view) ?

I wanted to instrument code to have a clear view on the critical sections of the application. Basically, I wanted to transform this kind of code (assuming bar is an instance variable) :

How (tricky technical details) ?

The bytecode of method foo looks like this.

GETFIELD #3
MONITORENTER
...
GETFIELD #3
MONITOREXIT

I want to get a reference to the target of the lock. Fortunately, the bytecode DUP is just what I need, since it duplicated the last element on the stack. I just have to transform the code so that it looks like this.

And guess what ? Using ASM, obtaining this result is really simple, as you only have to find MONITORENTER and MONITOREXIT, and add intructions before/after them. See the class LockTransformer.java for more details

Pitfall 1 - Synchronized methods

This approach is nice, but does not cover all possible cases. For instance, there is not MONITORENTER nor MONITOREXIT for synchronized methods. The JVM handles synchronization internally. The solution is to visit every synchronzied methods and insert calls to LockInterceptor at method entry and method exit.

Now the question is : what is the lock ? Simple : it is "this", and a reference to "this" in instance methods can be pushed on the stack using ALOAD_0.

So the following bytecodes should be added as the first statements of every synchronized method.

Pitfall 2 - Static methods

So far, things are cool. However, static methods are not. In a static synchronized method, the lock is the class itself, and it cannot be retrieved by ALOAD_0.

To get a reference to the class, I had to get the class FQN and insert a call to java/lang/Class.forName() instead of the ALOAD_0 instruction. So the following bytecodes should be added as the first statements of every static-synchronized method.

Conclusion & Special Thanks

That's it, synchronized blocks and methods are intercepted, and it did not required lots of code (3 bytecodes inserted at most, 5 classes to do that with ASM). The source is available on GitHub, if you want to have a look.

There are lots of subtle things going on, such as reentrant locks, but this is another topic.

vendredi 26 avril 2013

Tweet
Lately, I ran into a situation on my production environment : I was getting NullPointerExceptions that did not have any stacktrace. In this article, I will describe what happened and analyse the cause of such a strange behaviour. This article is inspired by Andrey Vityuk's blog, where I found the first pieces of information I needed to fully understand the problem, so, thanks Andrey !

The problem

Until then, you were having a day just like every other one, getting rid of bugs in a production system, one by one. Everything else being done, you open that old ticket that mentions NullPointerException. Fine, usual business, seems not critical, and you think "let's just grab the production logs of yesterday !".
But then, you realise that all you have are lines like these :

The first thing you think is "Did the developper really write log.error("Error during feature xyz\n" + e.getClass()) ?". And also "How am I supposed to solve this bug without knowing where the NPE is thrown ?".
So you open the java file that contains the feature xyz and there, you see something that you did not expect :

log.error("Exception during feature xyz", e);

Your logging framework is correctly used, nothing to report here, but then why do you have no stacktrace ?

Guess what ? It's Hotspot's fault !

Fortunately, you find Andrey Vityuk's blog and discover that Hotspot can simply throw Exceptions that do not have stacktraces for performance sake.
Indeed, if Hotspot detects that you have a hot method that throws exceptions, it can decide to remove the stacktrace so that the exception throwing mechanism is faster. Some people might call this "a bug", I won't.
We should always remember that Hotspot job is to make our code run faster. Period.
When Hotspot optimises code, it rewrites that code, it merges methods, swaps instructions and compiles it to native code. Once this is done, you don't have anything that look like your Java classes. Unfortunately, this comes at a very high price when Exceptions starts to mess around since Hotspot has to do the exact opposite work to rebuild the consistent stacktrace that you are expecting.
By definition, Exceptions are exceptional (or at least, they should be). In some code bases, we can find exceptions that are used to implement business cases. This can for sure be considered as a code smell.

Exceptions without Stacktraces, behind the scenes

In the print statement, I included the hashCode of the exception that was thrown. This is completely intended since it shows that after the optimisation is applied, the same exception is thrown for each call to throwNPE.
Indeed, the way Hotspot optimises this code is the following :

Create an instance of the exception that should be thrown

Clear its stacktrace

Cache this exception

Throw this exception instead of creating a new one for future calls

So hotspot does not only truncate stacktrace but also optimises the instance creation. At first, I thought that it was an optimisation in the way try/catch blocks work under heavy pressure, but it is not the case. Simply put, this optimisation consists in a singleton for a given exception, the rest of the code stays identical. Main gain : Hotspot does not have to rebuild a new stacktrace from optimised code each time the exception is thrown.
This behaviour is triggered by the JVM flag OmitStackTraceInFastThrow which is activated by default. We could get rid of it using -XX:-OmitStackTraceInFastThrow but I would not do that.

Final thoughts

This exception was the result of a coding mistake and happened enough time so that hotspot optimised it. It should be easily reproducible on any machine and be fixed right away.
We should not reduce the performance of a whole system in order for a bug to be easily fixed.
Ever faced this situation ? Another war story to share ? Let's discuss it in the comments !

vendredi 12 avril 2013

Tweet
Recently, I had to inspect some GC logs that had been created on a recent Hotspot JVM, using ParNew, CMS and iCMS. I did not find any tool that could report any information on iCMS duty cycles, so I wrote one that allows me to spot iCMS problem in a matter of seconds.

The problem

The Concurrent Mark and Sweep GC is almost always used on websites that have a lot of traffic and that cannot tolerate long pauses. Its incremental mode is often used on servers with limited hardware, so that small CMS collections on small chuncks of memory are executed between minor collections.

However, I could not find any GC log parser that took advantage of iCMS duty cycles information. In particular, how can we instantly see that something is wrong with our GC tuning ?

Incremental CMS reports

iCMS reports can be seen in ParNew GC log, in "icms_dc=..." parts of the message. For instance, take the following message :

This log shows us that two minor collections that occurred at 18.862ms and 21.028ms and an information about a iCMS collection. This collection was ran between the two ParNew collections and took 5% of the time between them. So, let's do some math.

21.028 - 18.862 = 2.166 secs
5% of 2.166 secs = 0.1083 secs

=> The iCMS collection took 108.3ms to complete, in the time frame of 2.166 seconds between the two ParNew Collections. This collection took place concurrently, so our application only suffered of a small overhead.

Some bad values

These log message were describing a normal situation, however, what would you think about these?

Yes, you read well, iCMS duty cycles ran 100% of the time between ParNew collections.

With ParNew being a stop-the-world collector, we can conclude that at 928404.003, a ParNew collection stopped the application for 0.07 seconds, afterwhat a incremental CMS collection was run concurrently during the whole time the application was working, and until it was stopped again at 928405.145 by a new ParNew collection.

So basically, during that time frame, our application was either paused or slowed by an iCMS collection.

Now, the big question is, if we have a 3-days GC log, how can we instantly know that we have a problem linked to iCMS duty cycles ?

Cumulative frequency of iCMS

My idea was to create a graph that sums the ratio of icms_dc time during the execution of the application.

The graph has the following characteristics :

X axis is the time spend in iCMS duty cycle in percent

Y axis is the cumulative frequency of such iCMS duty cycles

My goal is to have the lowest possible iCMS duty cycles, so let's say that I want 90% of my iCMS collections taking less than 10% of the time between two minor collections.

With such requirements, I can say that this GC log has no problem regarding iCMS :

However, this application does. It suffers from long iCMS duty cycles and some measures should be taken to reduce the pressure of Garbage Collectors. We can see that only 80% of iCMS cycles take less than 20% of the time available, which is not good. Even worse, look at the 90th, 95th and 99th percentiles. iCMS is definitely working too much and causes too much overhead on the application.

Finally, I can merge my graphs to make sure that my tuning is really improving the application Garbage Collections.

samedi 16 mars 2013

Tweet
Last week, Ippon Technologies published several articles about Java Performance Tuning. Those articles were interresting, but not compeltely accurate. In this post, I will give my own conclusions and remarks.

But first, a disclaimer : dear Ippon readers, what you did represent a huge amount of work, and I don't want to criticise that. I like what you did, it is a very good job. However, I would have appreciated some more information, and I will try to describe it here.

1 - The "Prove it !" mindset

I often repeat to anyone willing to hear it that they must prove what they think before taking any action. I call this the "Prove It !" mindset, but it is also well known as the "Diagnose before Cure !" or "Measure, Don't Guess !" principles.

I tend to be unhappy when solutions are applied without ever any measurement that this would do any good. If someone says that he/she magically has the solution to a problem that is not measured, it is the same as taking medicine without even knowing the kind of desease you have : it may work, or not, or even do nothing.

The more one prove that his/her decision is the right one, the better knowledge is spread. If no-one understand why you did something, discussions like the following will eventually happen : "Oh, that thing was made by an expert, DO NOT TOUCH IT ! If it work, don't fix it !"

2 - Goals

Performance audits are like recursive functions : if you don't have any stop condition, you could go on forever.

When we write benchmarks, we want to answer a very precise question, like "Can my application handle 1200 requests per second ?". When you have your first answer (usually "No", otherwise no-one would have called you in the first place), you can work to reach that goal.

Note : one goal can include "... and I cannot have more that 32 GB of RAM on my server, which I cannot duplicate".

However, there was no precise goal Ippon's articles. They were more a tooling showcase than a real-life situation. So, to be completely exhaustive : you *need* to have goals, otherwise you won't ever make your client smile.

3 - Testing conditions

In the first article, we can see that the default configuration (-Xms128m) did produce an OutOfMemoryError. The conclusion was the following :

"The first results are as follow: the application quickly goes up to 285 requests per second (req/sec), then slows down and throws “java.lang.OutOfMemoryError: Java heap space” errors after nearly 700 users."

The problem with this sentence is that what we did was not a benchmark per se. A benchmark has to give stable results over time. Here, we just cannot say that the application handles 285 requests per second.

What we saw here was that the application, put under a heavy load, was able to cope with 285 requests per second at best. But since the application crashed, we cannot draw any conclusion from this. The test environment is just not ready. In the last article, the 1GB Heap results is taken as a starting point, which shows that this 285req/s was not a benchmark result.

I think what Julien Dubois meant, in the first article, was that without a proper environment, tests could not be seriously made. As a general rule, we want to have a stable testing environment that does not crash even if you run a test during several hours.

So this first article was related to performance tuning, but was not about optimisation. It was only about setting up the environment, which is a prerequisite for any rigourous benchmark.

Finally, the conclusion mentions "548 req/sec and 0,2% of HTTP errors". Does this mean that at some point, the application still crash in a OOME ? If that is the case, we can prove that the test environment is not good enough by increasing the loop count of the thread group in JMeter. And if it is the case, then the environment is still not ready.

4 - The right fix

I did not understand why the second article started with a focus on the memory consumed by Dandelion library.

In the first screenshot, we can see that Dandelion's objects represent about ~130MB of the total heap, which has been set to 1GB, so we focus on a consumer of 10% of the heap. The questions are : do we really have a memory problem ? How did we prove it ? Are those -.20% http errors caused by a bad memory consumption pattern ?

Long story short : upgrading a library and refactoring your code to go stateless makes you handle less requests per second (532req/s instad of 548req/s) with more http errors (0.70% instead of 0.20%).

In that case, I tend to think that these actions were not the right fix. The causes of the HTTP errors have never been cleary defined, so we might end up with new errors, who knows ?

I would have liked more justifications for the actions that were taken. Did we have a memory consumption problem ? How did we prove it ? How did we prove that the taken actions were actually solutions, and not problems we were adding in the code ?

5 - Audit conclusion

The conclusion of the last article mentions that going back to -Xmx128m produces about the same result, which is ~1200req/s.

"At the beginning of the tests, we had to increase our heap memory size to 1 Gb, and could only serve 548 req/sec, with some HTTP errors. After completing our audit, we are now back to 128 M, and can serve 1225 req/sec with no error at all. We expect those results to be even better on a real server, with many cores and threads, and where removing JVM locks will have a more significant impact."

I could not reproduce that.

On my machine, decreasing the heap to 128 MB makes the result go down to ~180req/s, which is just not acceptable. The following screenshot shows the memory consumption of the application : the old generation is completely filled, resulting in a huge number of full GC (~2500) that represent 15 minutes waiting for the GC to finish.

Is this memory reduction thing a typo ? How could an application behave identically with less memory if its liveset does not even fit in the heap ? I would like to know more about it.

6 - More testing !

I checkout'ed the source (revision fa1e1a8f86c574b765d65b26fd858e0d28ae81fc, 2013-03-15 09:36:26) and made my own tests. Here are some more conclusions.

Note : I have not the same testing environment. I chose to reintroduce HSQLDB and slightly modified the JMeter test so that 10 visits per pet per owner are added before the test takes place. This is because the HSQLDB is flushed are each server restart. The req/s results will be different, since I do not run the tests on the same hardware, but it is not the point here.

With 1 GB, we can see that we are now down to 3 full GC (total pause time ~2s), which is pretty good. My test finished with an average of 600req/s.

But can we achieve the same performances if we reduce the size of the heap ? The answer is yes. By tuning the different memory pools, I managed to get the same number of requests per second with the following parameters : -Xmx512m -XX:NewRatio=1 -XX:SurvivorRatio=14

Even more tests, for science. What about switching to ParNew and CMS ? By switching to CMS, we can expect less significant pauses, since only ParNew pauses (~20ms) would stop the application. However, this comes at a price : a decrease in the overall performance since CMS implies much more overhead than other GC. Well, this is easily verified : by adding CMS (-Xmx512m -XX:NewRatio=1 -XX:SurvivorRatio=14 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC), I got an average of ~537req/s, with 49 CMS cycles.

Finally, considering the memory consumption profile of the application, we can set the CMS starting threshold to a higher value. This way, we will get less concurrent execution of CMS, meaning more CPU cycles for our application. By adding -XX:CMSInitiatingPermOccupancyFraction=70, we end up with ~599req/s and only 1 CMS execution.

Wrap Up

Event if I would have preferred much more details about the choices that were made, the performance audit that was made is pretty cool :)

I hope you enjoyed having more information and more test cases on petclinic. Do not hesitate to try any configuration that comes through your mind, as long as you measure if it improves the situation, and as long as you are not afraid to revert your changes if you cannot prove any improvement. This part is one of the most entertaining things on the JVM.

The default GC causes long STW pauses, and that could have become a problem if the heap was set to more than 8 GB. The switch to CMS comes at a price, but with enough testing and tuning, we can manage to have identical performances.

Any idea how to improve these articles ? Some war stories that you want to share ? Let's talk about this in the comments !

samedi 23 février 2013

Tweet
Last week, Olivier Croisier resurrected an old coding challenge in his blog, the Coder's Breakfast. As geeky as every challenge can be, I immediately started to think about potential solutions. This led me to an amusing discussion with a Scala guy. This post in a follow-up of that discussion.

Problem

The discussion was like :
Him : "Pff, it's easy, do a Stream"
Me : "Err, yeah but you would have to generate all numbers from 0 to max"
Him : "Of course not ! You filter ! I can do it in one line right now"

I was quite surprised by his arguments, and wanted to explain that I was not looking for brievty but for optimization. So I told him that if our discussion were for Java, it would have been like :
"Pff, easy, do an ArrayList
- [same answer]
- Not all, just use Guava !"

My point was that a data structure cannot be an answer to a complex problem, we have to visualize the whole problem, not just by the first implementation that comes to our minds.

This is commonly known as a Space/Time tradeoff (or a CPU/Memory tradeoff) : if you write a very small program, such as a one-llner, then chances are that this space gaiin will have a time cost (wasted CPU time). On the other hand, if you craft a program that do not waste any CPU cycle, then chances are that it will take more space (either in memory or by the number of LoC).

Note that I perfectly know that in Scala and Clojure, Streams and seqs are lazy, where ArrayList is not. It was just a troll.

Some algorithm

For me, this challenge is about "What you are going to do" instead of "How you are going to do it". Brute-force solutions are different in their implementation details, but follow the same algorithm :

- For each number n taken between [0; max]
--- If n fulfills a test, keep it

In those one-liners wars, we do not think about algorithm, and we should feel bad about it, especially in a coding challenge. Such challenges are the only place where we can seek premature optimization, for the beauty of the exercise.

So let's go back to the roots of programming : algorithms.
The code we have to create can be splitted in the following components :Generator -> Filter -> Aggregator
We know that the upper bound will be a power of 10, and won't be above 10^10.

First implementation

The most basic implementations does this :

- Generate all numbers in [1; max] that fulfill a given test
- Iterate over every element starting at index 1 an compute the difference between the element and the previous one
- Keep the highest difference
- Count the number of results

Considering we have numbers and solutions, our algorithm would generate numbers in O(max).

Second implementation

But we can achieve a better solution : let's generate every number by picking digits in a list of "allowed digits".

To calculate every non-repeating-digits number of length x,

- For each digit d taken in [1;9]
--- Find all numbers of length x that starts with d and that are composed by every digits from [0; d[ and ]d; 9]

To find all numbers of length x that starts with d and that are composed by someDigits

- For each digit i taken in someDigits
--- Add d at the position x of all numbers of length x-1 that starts with i and that are composed by someDigits without i

The main goal of this algorithm is to have no "Filter" component. We perfectly know the range of acceptable digits, so we can just generate every combination by restricting this range to a smaller subset. By using recursion, the program may be harder to reason about but also simpler to test and faster to run.

Wrap-Up

My Java implementation is way longer than a Scala one-liner, however it is also faster from an algorithmic standpoint.

However I do not mean that it is is better implementation.

In a real project, I would probably prefer writing a simple (ineffective) solution in 10 minutes than a very optimised one that would take hours or even days. But keep in mind this Space vs. Time tradeoff. There is no such thing as a perfect solution, there are only tradeoffs and we must know them to make better decisions.

Note that I won't write here about the wall-clock standpoint of those solutions since it would make me create microbenchmarks, which is something one should not do.

Final thoughts : Thanks Olivier for this challenge ! It was a fun exercise =]