Friday, May 29, 2009

Update June 4, 2009The original post favored Dictionary as the fastest data structure both for add and contains. After some debate and some re-runs of my test code, I found the result to be wrong. HashSet is faster. Probably the results were affected by the workload on my computer. I ran one List test, then a Dictionary test and finally a HashSet test - I should have run them multiple times and at the same workload (i.e. no programs or processes running). Anyway, on to the article.

Still working with high-volume realtime datafeeds, I'm struggling to understand where the bottleneck in my code is. It's not with the database, it's not with the network - it's somewhere in the code. Now that doesn't help a lot.

So I decided to have a look at the different data structures that could be usable for my needs. Right now I'm using a List to keep track of my orders, and each time I get a new order, I check that list for the given ID.

So I'm starting to wonder, maybe there is a performance issue with the List data structure. So I made a small test with the following code:

I created similar test code for Dictionary and HashSet. In all of the tests I used a loop where the given key always existed in the data structure. I used Add instead of Insert, because of tests I've seen online show that this is much faster. For example, this one.

First run, I used 1,000 for the noOfIterations variable and ran it three times.

The horizontal scale here is in milliseconds, so things are reasonably fast. A value of 0.2 gives you a possible 200 adds per second. As you probably can see without checking the numbers, dictionary is faster. List is slower for lookup, but HashSet suprises a little with the slow add function. So what happens when we go to 100,000 items?

OK, List is a lot slower for lookup still. Add seems to compare ok though. Let's see how it compares if we remove the lookup-time from the chart: This is wrong, see "The aftermath"

Now that's a result! Turns out List is your winner if fast adding is all you care about. If you want to look up your values later though, dictionary is your winner. Hope this will save some time for other developers in the same situation. Also feel free to comment if you find different results, or a problem with my test!

The aftermathBecause of the controversy involved in HashSet being slower than Dictionary, despite having only one value and the Dictionary two, I re-ran the test on my computer, and on a colleagues. Instead of running the tests one by one, I ran three times HashTest and three times Dictionary, and picked the averages. The result you can see below, HashSet is faster than Dictionary both for Add and Contains methods.

2) A Dictionary and a HashSet are both Hashtables of course. Insert and retrieve time are both O(1), but the constant for inserting into a hashtable is much larger than a list due to having to compute the correct bucket, resize the table, etc.

3) "I created similar test code for Dictionary and HashSet." Don't say you used similar code. Post it! I just tried comparing adding 1 million items using Dictionary and HashSet and HashSet was FASTER than Dictionary doing inserts. There is no WAY that Hashset should be that much slower, but post your code, maybe you have some data that somehow causes it to be slower.

I expect that ANY graduate from ANY CS releated studies would know which data structure use in any scenario.

To be honest, if someone would come for interview and he(she) would not know asimptotic time boundaries for basic operations on list, balanced tree, hashset etc. and this person would not know which structure is best suited for which scenario it would be a disaster.

I expect that ANY graduate from ANY CS releated studies would know that if i bring a class and call it a List and the api looks like it was a list inside it can be anything! If sth looks like duck but walks like a dog it cannot be duck. You never know what is behind the facade mr. smart.

Tests like this should always be taken "as is". You could write a book on comparing these data structures properly:- how is the List initialized? A .NET list uses an array internally, so if you initialize it with 1,000,000 objects then it mitigates the whole "re-create the array" when we add objects (i forget the algorithm, but for a 1,000,0000 objects it would happen more than several times)- thread-safety...how would applying Read/Write locks affect the collection- when you say "lookup" what does that mean? I can write you a query with PLINQ (.NET 4.0) that uses multiple processor cores and could come back factors faster.- speaking of LINQ, you can also override the where implementation to use an internal hash. So, in effect you can use a list for adding/removing and hashing for looking items up

Once you start getting into that kind of performance tuning it gets blurry as to which data structure is "better"

Thanks for all your comments, I'll definitively be posting the rest of the test code. I agree with what you say, it does seem strange that HashSet would be outperformed by Dictionary. Hence this post...

@Bart Czernicki:Thanks for your input! I think PLINQ is very interesting, and we've tried the Parallel FX CTP before. I did not think about taking it into this test, to be honest. I guess this is more of a vanilla .NET 3.5 test.

The LINQ override idea looks interesting, maybe I'll find time to look into that later.

In retrospect the whole test should have been better planned, and according to some, a list should not even be in here. But it is, and at least the material has potential, just look at the comments.

Hello,The Article on Performance Testing report give detailed information about it.. Thanks for Sharing the information about the Performance Testing. For More information check the detail on the Performance Testing here Software Testing Services