Monday, October 18, 2010

Motivation:
As many of you might know, Java does not provide microseconds granularity. When using Cassandra DB you need to make sure that your clients have their clocks in sync. That is the requirement. Other non-relational databases like Voldemort, this is not necessary (they use Vector Clocks).
If we decided to use milliseconds resolution (what Java provides), sometimes if there are two inserts (or one insert and one delete) within the same millisecond (typical example in Lucandra when updating a row),
the second operation gets discarded as it happens at the same time than the previous request. This is why we had to come up with a way to provide every thread that requests a time a unique timestamp.

Goal:

Evaluate the performance of the three Hector time resolution algorithms with Microseconds accuracy.

-The fully synchronized algorithm guarantees unique timestamp in microseconds per threads and across threads.[1]
-The second one uses Atomic Longs to accomplish the same goal than the first one, but presents some race condition and the results might not be guaranteed across threads. I contributed this code and it is in Hector branch 6 but not main branch. [2]
-The third one does not guaranteed that a thread in two consecutives calls obtain the same timestamp. (risk of missing operations as Cassandra discard them if they contain the same timestamp). Currently in hector main branch. [3]

Conclusion:

The fully synchronized timestamp resolution algorithm present a decent TPS, almost imperceptible comparing with the non-synchronized one and providing a “bullet proof” mechanisms to guarantee unique time resolution within the same thread and across threads as well.

Before the benchmark, I let the test run 5k operation with no time calculation to warm-up. Without it, the first set of test to be executed always shows a slightly less performance. So I consider it was fair to include the warm-up first. The 100 threads test has not warm-up.

The overall operation is 5,000,000. Regardless the number of threads. This way I can see how fast or slow it works.

How to interpret the results:

-Look at the attached files. There 2 files per benchmark. (one shows the TPS and media, and the other one shows the CPU usage)
-For example: 100Threads-15secs-50kops means:

o100 threads

oGraphic has timeframes of 15 secs

oEach thread perform 50 K operations

-There are three colors indicating the performance of the three algorithms (Blue, Red and Green)

BLUE: the createTimestamp method is fully synchronized

GREEN: Partially synchronized using atomic long with possible race conditions.

RED: Not synchronized at all with high risk of getting the same timestamp within and across threads.

Case 1: CPU and memory. Notice how the cpu usage stays constant during the execution of the three test cases (a spike down indicates the end and beginning of the next test).

Case 1b: 8 Threads - 1000 K operations with warmup (5K operation per thread and non-synchronized is first)Note: Including a warmup and switching the non-synchronized operation in the first place seems to make a different. The lowest media is of course for the non-synchronized algorithm, the highest value is for the fully synchronized algorithm and as expected, the algorithm that uses AtomicLong is slightly in between (lower values are better).

Case 1b: 8 Threads - 1000 K operations with warmup (5K operation per thread and non-synchronized is first)Note: In this second graph second case (TPS), the three algorithms seem to perform just as good. (highest values are better)

No I don't have it in my use case currently. If say you load balance flow across a VIP and you have someone hitting you frequently from an outside service it's possible, which sounds rare but bots can cause this pretty easily. I like the solution though for a single process or one where your writers are logically partitioned.

@Ray it is only valid per process of course.These algorithms guarantees uniqueness only per process/JVM.

If you expect concurrent access to the same columns across machines and you care about it, you have to either tolerate those inconsistencies or re-design the schema.

This examples I posted help you to deal with for instance Lucandra operation where you delete+ insert(update) columns. Normally that happens so fast that you might miss deletes as the delete and insert happens at the same relative time.