Huge portions of the Web vulnerable to hashing denial-of-service attack

A flaw common to most popular Web programming languages can be used to launch …

Researchers have shown how a flaw that is common to most popular Web programming languages can be used to launch denial-of-service attacks by exploiting hash tables. Announced publicly on Wednesday at the Chaos Communication Congress event in Germany, the flaw affects a long list of technologies, including PHP, ASP.NET, Java, Python, Ruby, Apache Tomcat, Apache Geronimo, Jetty, and Glassfish, as well as Google's open source JavaScript engine V8. The vendors and developers behind these technologies are working to close the vulnerability, with Microsoft warning of "imminent public release of exploit code" for what is known as a hash collision attack.

But Klink and Wälde showed that "PHP 5, Java, ASP.NET as well as V8 are fully vulnerable to this issue and PHP 4, Python and Ruby are partially vulnerable, depending on version or whether the server running the code is a 32-bit or 64-bit machine."

"This attack is mostly independent of the underlying Web application and just relies on a common fact of how Web application servers typically work," the team wrote, noting that such attacks would force Web application servers "to use 99% of CPU for several minutes to hours for a single HTTP request."

"Hash tables are a commonly used data structure in most programming languages," they explained. "Web application servers or platforms commonly parse attacker-controlled POST form data into hash tables automatically, so that they can be accessed by application developers. If the language does not provide a randomized hash function or the application server does not recognize attacks using multi-collisions, an attacker can degenerate the hash table by sending lots of colliding keys. The algorithmic complexity of inserting n elements into the table then goes to O(n**2), making it possible to exhaust hours of CPU time using a single HTTP request."

Prior to going public, Klink and Wälde contacted vendors and developer groups such as PHP, Oracle, Python, Ruby, Google, and Microsoft. The researchers noted that the Ruby security team and Tomcat have already released fixes, and that "Oracle has decided there is nothing that needs to be fixed within Java itself, but will release an updated version of Glassfish in a future CPU (critical patch update)."

On Wednesday, Microsoft published its own security advisory, recommending that all ASP.NET website admins evaluate their risk and implement a workaround that Microsoft has posted while it works on a patch to be released in a future security update.

All versions of the .NET Framework are affected by the vulnerability, according to Microsoft. "The vulnerability exists due to the way that ASP.NET processes values in an ASP.NET form post causing a hash collision," Microsoft wrote in its advisory. "It is possible for an attacker to send a small number of specially crafted posts to an ASP.NET server, causing performance to degrade significantly enough to cause a denial of service condition. Microsoft is aware of detailed information available publicly that could be used to exploit this vulnerability but is not aware of any active attacks."

The potential for hash collision attack is garnering quite a bit of discussion on Twitter under the hash tag #hashdos and video of the researchers' presentation is posted on YouTube.

Not hating on ars or any of the other tech websites, but I have to wonder how many of the attacks are actually a result of their discovery and postings on sites like these.

This is what vulnerability researchers do, try to compromise systems (without harming anyone) so that the companies that build products can make them more secure, and theoretically make everyone safer. At Ars, we're journalists, so we like writing about it ;-)

LOIC? According to the articles I read on Ars Technica, that thing didn't even do the most basic thing it was advertised as doing: obfuscate the attacks! Hence why teenagers were getting picked up by FBI and Scotland Yard/ Special Branch, several months ago. Why would anyone trust Version 2 if Version 1 was so poor?

Everyone excuse me for not getting this... Why are hash functions required for processing POST data efficiently? I don't understand why it might be necessary to index the POST data, either as a whole or for individual variables. Is it just so that the POST data can be stored in case the application developer needs to access it, so that it can be retrieved quickly using a fixed-width indexed hash code? Why in the world did the developers of these languages base that hash purely on the POST data itself? What's wrong with just using a completely random hash, or a sequential numeric code? Surely that's good enough for the purpose I am imagining?

As for not using a "nonce", and not randomising the hashes; that's just poor. It's shocking that this problem exists almost across the whole spectrum of languages... Is the industry really this bad at security?

Not hating on ars or any of the other tech websites, but I have to wonder how many of the attacks are actually a result of their discovery and postings on sites like these.

So I've done some security research before using some open source stuff on my own projects. I found problems with an encryption implementation and contacted the authors (including proof of concept). It's never been fixed, that was over a year ago. During that year someone (much likely smarter than I) with nefarious intent could well have discovered that same problem, and used it for evil. Had I publicized the problem there would have been additional pressure on the authors to correct the issue, and those affected would have had the knowledge required to take steps to protect themselves through alternate means.

LOIC? According to the articles I read on Ars Technica, that thing didn't even do the most basic thing it was advertised as doing: obfuscate the attacks! Hence why teenagers were getting picked up by FBI and Scotland Yard/ Special Branch, several months ago. Why would anyone trust Version 2 if Version 1 was so poor?

Correct me if I'm wrong, but wasn't version 1 originally intended as a security testing tool and wouldn't that be the reason identity obfuscation wasn't built in?

Not hating on ars or any of the other tech websites, but I have to wonder how many of the attacks are actually a result of their discovery and postings on sites like these.

So I've done some security research before using some open source stuff on my own projects. I found problems with an encryption implementation and contacted the authors (including proof of concept). It's never been fixed, that was over a year ago. During that year someone (much likely smarter than I) with nefarious intent could well have discovered that same problem, and used it for evil. Had I publicized the problem there would have been additional pressure on the authors to correct the issue, and those affected would have had the knowledge required to take steps to protect themselves through alternate means.

Yea but this one sucks because now one million+- idiots can bring down a web server with one line of HTML..

LOIC? According to the articles I read on Ars Technica, that thing didn't even do the most basic thing it was advertised as doing: obfuscate the attacks! Hence why teenagers were getting picked up by FBI and Scotland Yard/ Special Branch, several months ago. Why would anyone trust Version 2 if Version 1 was so poor?

Correct me if I'm wrong, but wasn't version 1 originally intended as a security testing tool and wouldn't that be the reason identity obfuscation wasn't built in?

Correct...anyway, LOIC would be completely unnecessary when a browser extension that could craft the type of POST data likely to cause the colliding hash keys would be all that is necessary to take out a server with only one single request (no distribution necessary) and have the added benefit of easily using TOR.

Why are hash functions required for processing POST data efficiently? I don't understand why it might be necessary to index the POST data, either as a whole or for individual variables.

The term is overloaded. AFAICT from the article, this is referring to parsing "foo[bar]=42&foo[baz]=11" into the equivalent of $_POST['foo'] = array('bar' => 42, 'baz' => 11); -- the "hash" here is the hash / dict / associative array data type that the parsed data is placed in.

PHP has a `max_input_time` setting which could mitigate this, but it's set at 60 seconds by default and MaxClients*60 is a lot of CPU time to burn.

"Oracle has decided there is nothing that needs to be fixed within Java itself, but will release an updated version of Glassfish in a future CPU."

Is this a typo, someone misspeaking or Oracle overloading a standard term?

Critical Patch Update.

I should have included that in the story, I will update.

Bad terminology aside, Oracle is actually probably right in their assessment for once. Java itself isn't flawed in this case, but rather all the implementations of Java EE (or more specifically, the implementations of javax.servlet).

So it's more precise to say that Tomcat is broken.

This is one situation where progress, isn't. CGI in C++ isn't susceptible to this issue unless the developer has built a system on top of it that introduces the issue.

Why are hash functions required for processing POST data efficiently? I don't understand why it might be necessary to index the POST data, either as a whole or for individual variables.

The term is overloaded. AFAICT from the article, this is referring to parsing "foo[bar]=42&foo[baz]=11" into the equivalent of $_POST['foo'] = array('bar' => 42, 'baz' => 11); -- the "hash" here is the hash / dict / associative array data type that the parsed date is placed in.

PHP has a `max_input_time` setting which could mitigate this, but it's set at 60 seconds by default and MaxClients*60 is a lot of CPU time to burn.

So I have a question about attacks against hosted sites on major hosting providers. Most of the cheap hosting services are going to be VMs that are sharing a physical server with many others and, hopefully, using HA to move from one server to another in the event of hardware failure or other lack of physical resources.

Now lets suppose that someone does make a script kiddie friendly tool to exploit this en masse. If it just takes 1 request to tie up multiple hours of CPU time, then it is not unreasonable that 8-ish requests could tie up a beefy VM host enough that all the sites will dump to another...rinse, repeat until all of the physical servers in the cluster are maxed and all of the sites in the cluster, whether they were directly attacked or not, go offline.

Is this scenario reasonable? I don't have much experience with major hosting providers so I don't know what defenses are in place.

Not hating on ars or any of the other tech websites, but I have to wonder how many of the attacks are actually a result of their discovery and postings on sites like these.

So I've done some security research before using some open source stuff on my own projects. I found problems with an encryption implementation and contacted the authors (including proof of concept). It's never been fixed, that was over a year ago. During that year someone (much likely smarter than I) with nefarious intent could well have discovered that same problem, and used it for evil. Had I publicized the problem there would have been additional pressure on the authors to correct the issue, and those affected would have had the knowledge required to take steps to protect themselves through alternate means.

Yea but this one sucks because now one million+- idiots can bring down a web server with one line of HTML..

All the more reason to get it patched, eh. Next time, it won't be someone quite forthcoming with that information. Remember, a LOT of vulnerabilities, never get mentioned. As those who find them, know it is better to not kill the goose that laid the golden egg, especially if you... well, that would be saying.

Not hating on ars or any of the other tech websites, but I have to wonder how many of the attacks are actually a result of their discovery and postings on sites like these.

So I've done some security research before using some open source stuff on my own projects. I found problems with an encryption implementation and contacted the authors (including proof of concept). It's never been fixed, that was over a year ago. During that year someone (much likely smarter than I) with nefarious intent could well have discovered that same problem, and used it for evil. Had I publicized the problem there would have been additional pressure on the authors to correct the issue, and those affected would have had the knowledge required to take steps to protect themselves through alternate means.

THIS.

I have found a vulnerability in W2K and the original Win XP. It gave me SYSTEM level command prompt at the login screen. I had but to press a hotkey on the keyboard and the cmd.exe would pop up on top of the login screen. It didn't matter if no one was logged in, or if someone was logged in and the screen was locked. I had no one to report this to, so I kept quiet. The vulnerability was finally fixed in XP SP2... about 3 years after I found it.

To an indie researcher finding a vulnerability is just the beggining of the battle to have it fixed. Reaching the correct people at the vendor in question is hard. Their front line folks (sales and tech support) won't know what you are talking about, and won't even know what to do with you.

An outsider needs to know the right people with the access to the project managers and developers.

So it isn't that a giant dev house is ignoring the issue. It's more of a case of them being so large, and the layers of support techs on one end, and management on the other end, so thick, that the word never reaches the project manager and developers.

Yeah, the web basically runs on hash tables. HTTP is key-value pairs. XML is key-value pairs. JSON is key-value pairs. Most template languages are driven by key-value pairs. JavaScript is among the most table-oriented programming languages there is (next to Lua). By and large, the measure of a web framework is its implementation and utilization of tables. Sure, text manipulation is also important, but you really want to make sure that your base table-like object class is efficient, reliable, and *secure*.

It's a shame I can't provide hints to PHP to indicate which POST variables are permitted for a given directory / script... And program my web servers to ignore all requests that include variables that are not explicitly permitted, on the presumption that they are malicious requests... Would be a nice locked-down config for security.

I'm surprised it has taken so long for this to be investigated in the server-side web scripting languages.

Quicksort also has pathological O(n**2) cases... Maybe quicksort-based DoS will be the new thing for 2012?

Quote:

As for not using a "nonce", and not randomising the hashes; that's just poor. It's shocking that this problem exists almost across the whole spectrum of languages... Is the industry really this bad at security?

Yes, the industry really is this bad at security. From what I've seen, the majority of developers nowadays don't even know what a hash table is - they assume that key-value pair mappings are either magical black boxes or implemented as linear traversal. I think people developing web frameworks are more likely to understand how this stuff works, but the lack of well-publicised disasters with pathological hash tables has made it easy to ignore.

(Edit)Also, it is worth noting that there may not be a simple drop-in solution for this to cover all cases. Web frameworks can change request variable lookups to use a keyed hash or a non-hashtable algorithm (red-black trees, skiplists, etc) but that doesn't help the rest of the web app, which will almost certainly be using hashtables for various things. PHP doesn't (as far as I know) give developers access to the underlying hashcode functions and can probably be fixed quite easily, but Java and .NET both allow access to hashcode values (which the app can potentially use and require to be consistent) and also have the ability to use keys that are of custom types with their own (probably collidable) hashcode functions.

So both Tomcat and Geronimo are affected but what about regular old Apache webserver?

If you are using something like Modsecurity running on Apache as a sort of firewall for http requests, is it going to catch these attacks and prevent problems? get crashed by them and block access to your app, or just pass the bad request along where it can then crash your web-ap that is running on ASP.net, Geronimo, Rails or whatever?

Given the number of folks that rely on Modsecurity, this would seem to be a pretty important question to have an answer for.

Quicksort also has pathological O(n**2) cases... Maybe quicksort-based DoS will be the new thing for 2012?

I doubt it.I believe that sorting routines are typically called by the script.

Yes, and the script is using quicksort algorithm. But even in C, there is some protection. If quicksort would go into pathological O(n**2) case, system will stop quicksort algorithm and use some other sorting algorithms.

So, I doubt if quicksort-based DoS will ever be possible. Of course, if the programmer isn't an idiot.

But even in C, there is some protection. If quicksort would go into pathological O(n**2) case, system will stop quicksort algorithm and use some other sorting algorithms.

That's normally called "introsort". I just had a quick look at the API docs for PHP and .NET, and the one sorting function I looked at in each, they both mentioned "quicksort" by name. No mention of falling back to another algorithm, and no mention of introsort. If I remember right, Java has always used merge sort for pretty much everything, so should be fine.

So... even if there wouldn't be any threat... if our program written in .NET or PHP comes in the worst case for quicksort algorithm, it will take all CPU time - if there are many entries to sort. I know, that there are some methods to minimize that risk ( mixing entries before sorting etc. ) but... One way or another... It will crash the server ? How... simple and unstable solution.

Everyone excuse me for not getting this... Why are hash functions required for processing POST data efficiently? I don't understand why it might be necessary to index the POST data, either as a whole or for individual variables. Is it just so that the POST data can be stored in case the application developer needs to access it, so that it can be retrieved quickly using a fixed-width indexed hash code?

When you submit a HTTP request to a webserver, you can do it in usually two ways: GET or POST. Both ways allow you to send info to the server. GET consists data you send as part of the URL itself. This you can see. For example, when you click on your username here at Ars, it takes us to your user profile. The URL for doing that is arstechnica.com/civis/memberlist.php?mode=viewprofile&u=271803

The part before the question mark ("?") is called the path. It tells the Apache server that your request should be processed by the script called "memberlist.php".

The part after the question mark is called the query string. This part is passed to the PHP script. The first thing the script does is parse the query string and split it up into the individual parts. The ampersand sign ("&") is the delimiter between parts. So in the above example, there are actually two variables getting passed:

(1) a variable called "mode", which has a value of "viewprofile"(2) a variable called "u", which has a value of "271803"

The way PHP (and nearly every other language) stores this info is by putting it in a Hash Table. Other terms that mean the same thing as "hash table" are: "associative array", "associative map", "trie", etc. In a minute, I will tell you *WHY* they use Hash Tables to store this info.

This hash table is made available to the PHP programmer by means of what PHP calls a "super global". That means that there is a variable called $_GET that can be used by the programmer to get that info. For example, the programmer can easily retrieve the info like this:

So there you have it... Your URL has been turned into a two-element Hash Table. At the beginning I said you could use either GET or POST methods of submitting data. POST works the same way in that the PHP programmer will have another super global variable called $_POST that contains all of the posted info. The main difference is that there is no limits on the size of POST data. This is how you can upload a lot of data, (like this wall-of-text comment or a video to YouTube).

matthewslyman wrote:

Why in the world did the developers of these languages base that hash purely on the POST data itself? What's wrong with just using a completely random hash, or a sequential numeric code? Surely that's good enough for the purpose I am imagining?

Sure, these variables *COULD* be stored inside a simple array. In a simple array, there is no concept of Keys or Values. It's all just data. To get the 1st item in an array, you would supply a "key" of 0. To get the 2nd item, you'd supply a "key" of 1 ... and so on. But these numbers are not really "keys". The number is simply an index. An array containing 32-bit numbers knows that each element in the array is 4 bytes long. So when you say "give me the 5th item in this array", the program goes to the starting address of the array (which for this example lets say the address of the array in memory is 98765), and multiplies 4 (bytes) times 5 (the 5th element) and adds that to the address. So the program goes to that new address which is now 98785 (98765 + 20) and reads the next 4 bytes into the stack register, and then returns. Now you have a copy of the 5th item of the array sitting in the stack register for you to use in your program.

The problem is that specifying numbers is not always adequate. What if you are looking for somebody's name. You don't know that the first name is stored at the 7th item of the array, and the last name is stored at the 8th item of the array. Instead what is practical is to say "give me the item in the array corresponding to the key called 'first name'".

BUT, every time you wanted to get the item called "first name" out of an array, you definitely don't want to loop through the entire array until you find an element with the key matching the text "first name". That would take way too long and that's why we use hashes.

At it's most basic level, a HASH is just a binary number. The number comes from running the characters of the string through an math algorithm called a hash function. The algorithm spits out a fairly unique number for each string. It's deterministic; not random, so every time you run the string "first name" through the algorithm, it will spit out the same number.

The numbers are ORDERED when they are inserted into the hash table. So, say that "first name" results in the hash of 5000 and last name results in the hash of "4999". When they get inserted you will see 4999 first and then 5000. Having the keys in numerical order makes it many many many times faster to look up a key, and get the associated value. It means it doesn't have to loop through every item in the array. Instead it can check the middle element in the array, see if that number is higher or lower than the number it wants to find. If higher, check the 3/4ths element in the array; if lower, check the 1/4ths element in the array... and so on and so forth, each time cutting its search field in half until it finds the right number.

The attack that is the subject of this article is due to the fact that some hash algorithms produce more unique results than others. If hash tables used a very strong algorithm like SHA-1, then it would be immune to this attack (assuming the hash table is designed to overwrite colliding hashes). The probability of a colliding hash would be so low that it would practically never happen (unless on purpose due to an attack), and therefore it will hardly ever matter if one of the GET/POST fields were overwritten by another.

But practically, hash tables don't use strong hash algorithms like SHA-1 because they take too long to perform if you need high-throughput. Instead they use very simple hashes that are very quick, but can have tons of collisions. Because collisions are expected, the program can't simply over-write colliding hashes onto the same key. (Because what if the collisions are so common, that "first name" hashes to the same number as "last name". Then you end up with only one if the fields because the 1st one to be inserted got overwritten by the 2nd one). Instead it has to store all the colliding values. Since all of the same number keys will obviously ORDER to the same position within the hash table, that means that position in the hash table cannot store a simple value corresponding to the key. It has to store the address to another array (usually a "linked list"), which will store all the colliding values. Again, now you're back to looping through an array if you want to find the key you are looking for. That looping can slow your server down, so hopefully your hash algorithm is quick but also strong enough so that it doesn't make TOO many collisions. However, this attack is based on submitting tons of data, where all the fields are pre-calculated to hash to the same colliding number. That is the core idea of this type of DDOS.

Btw, most of the affected technologies on this list are actually NOT going to be vulnerable to this attack because decent webserver software like Apache (as mentioned in the linked white paper) limits the number of GET/POST fields to 100 by default, which pretty much stops this attack in its tracks because looping through just 100 collisions is not even noticeable from a performance degradation standpoint.

But even in C, there is some protection. If quicksort would go into pathological O(n**2) case, system will stop quicksort algorithm and use some other sorting algorithms.

That's normally called "introsort". I just had a quick look at the API docs for PHP and .NET, and the one sorting function I looked at in each, they both mentioned "quicksort" by name. No mention of falling back to another algorithm, and no mention of introsort. If I remember right, Java has always used merge sort for pretty much everything, so should be fine.

Back in the late 80's I looked into the quicksort source code for the C library, (I can't remember if it was (Microsoft or Borland C -- we were using both at the time). At that time the quicksort algorithm gave up after a set number of iterations, and used a simple bubble sort to complete the sort. The source code comments claimed that after a certain number of quicksort passes, the data was setup in such a way that a bubble sort was faster than continuing with the quicksort algorithm.

My guess is that if this was common knowledge in the 80's, nothing has changed since then and the actual "quicksort" algorithm you call in C or C++ is really a "quicksort / some_other_sort" combination.

But who knows, dumber things have happened, and every programmer thinks they're smarter than the previous one. I wouldn't be surprised if some time since the late 80's, some genius-idiot programmer "fixed" the quicksort algorithm to make it more efficient.

It'd be interesting to know what the GNU library's quicksort algorithm does. Oh, what the heck, back in a second... ... there, that wasn't too hard.

The GNU "glibc" quicksort algorithm devolves to an insertion sort after a set threshold is reached. To quote from the source comments:

/* Once the BASE_PTR array is partially sorted by quicksort the rest is completely sorted using insertion sort, since this is efficient for partitions below MAX_THRESH size. ... */

Assuming most interpretive languages call a pre-compiled quicksort algorithm, I don't think a quicksort attack is going to be very effective.

After watching the video, it was informative. However, I gleaned more information from the whitepaper linked to on Usenix; in fact, all the information was essentially the same. I will say, in all honesty, I think the whitepaper was butchered with the on stage description. I do wish Alexander Klink and Julian Wälde would have given more credit to Steve Crosby and Dan Wallach than they had to the developers of CRuby and Perl.

I'm glad they took it upon themselves to investigate other server-side languages to see if they were vulnerable to this attack. It is most certainly a profound and important discovery, but I do think they could have offered more credit to some of the originators.

swr wrote:

This sort of attack is not new. Here's one relating to the Linux network stack back in 2003:

Everyone excuse me for not getting this... Why are hash functions required for processing POST data efficiently? I don't understand why it might be necessary to index the POST data, either as a whole or for individual variables. Is it just so that the POST data can be stored in case the application developer needs to access it, so that it can be retrieved quickly using a fixed-width indexed hash code? Why in the world did the developers of these languages base that hash purely on the POST data itself? What's wrong with just using a completely random hash, or a sequential numeric code? Surely that's good enough for the purpose I am imagining?

Parameters are generally put into tables of name/value pairs, and these hash the names for quick value lookups. The attack is to use names that hash identically, and hence force degradation into linear runtime performance.

Quote:

As for not using a "nonce", and not randomising the hashes; that's just poor. It's shocking that this problem exists almost across the whole spectrum of languages... Is the industry really this bad at security?

Hash-tables are high-performance general-purpose data structures. Normally using a universal/randomized hash function would be a really bad idea, as it'd make the hashtable slower with no upside.

Btw, most of the affected technologies on this list are actually NOT going to be vulnerable to this attack because decent webserver software like Apache (as mentioned in the linked white paper) limits the number of GET/POST fields to 100 by default, which pretty much stops this attack in its tracks because looping through just 100 collisions is not even noticeable from a performance degradation standpoint.

Sorry for the DL;DR but I hope this helps explain some stuff.

Is there opportunity for an IDS / IPS to detect this type of attack using signatures?

Oh, and thanks for the long read. I had no idea what web server hash tables were.

We've spent decades coming up with more abstraction layers and with higher level languages... and here we are looking at the quicksort C function to see if our code written in high level languages is vulnerable.

The numbers are ORDERED when they are inserted into the hash table. So, say that "first name" results in the hash of 5000 and last name results in the hash of "4999". When they get inserted you will see 4999 first and then 5000. Having the keys in numerical order makes it many many many times faster to look up a key, and get the associated value. It means it doesn't have to loop through every item in the array. Instead it can check the middle element in the array, see if that number is higher or lower than the number it wants to find. If higher, check the 3/4ths element in the array; if lower, check the 1/4ths element in the array... and so on and so forth, each time cutting its search field in half until it finds the right number.

The attack that is the subject of this article is due to the fact that some hash algorithms produce more unique results than others. If hash tables used a very strong algorithm like SHA-1, then it would be immune to this attack (assuming the hash table is designed to overwrite colliding hashes). The probability of a colliding hash would be so low that it would practically never happen (unless on purpose due to an attack), and therefore it will hardly ever matter if one of the GET/POST fields were overwritten by another.

Thanks, when the article said "hash collision" I was baffled, as I would think it would simply overwrite the value, but I was thinking of key collisions. Makes sense. I also had no idea these kind of collisions could happen since I mostly use perl and apparently that was fixed years ago. In fact I always find it fun to see how Data::Dumper shows different orderings of hash arrays, even when just looping through some code over and over.

So... even if there wouldn't be any threat... if our program written in .NET or PHP comes in the worst case for quicksort algorithm, it will take all CPU time - if there are many entries to sort.

These pathological cases are very unlikely to happen unless someone is trying to make it happen. The larger the set of data, the less likely the case is to occur by chance. For small data sets, O(n^2) isn't even a problem.

It's not unheard of for data to naturally have structure that is related mathematically to the hashing algorithm or quicksort pivot selection, and thus consistently tickle the pathological case. I've never seen this happen in practice, though.

lake393 wrote:

The numbers are ORDERED when they are inserted into the hash table. So, say that "first name" results in the hash of 5000 and last name results in the hash of "4999". When they get inserted you will see 4999 first and then 5000. Having the keys in numerical order makes it many many many times faster to look up a key, and get the associated value. It means it doesn't have to loop through every item in the array. Instead it can check the middle element in the array, see if that number is higher or lower than the number it wants to find. If higher, check the 3/4ths element in the array; if lower, check the 1/4ths element in the array... and so on and so forth, each time cutting its search field in half until it finds the right number.

Back in the late 80's I looked into the quicksort source code for the C library, (I can't remember if it was (Microsoft or Borland C -- we were using both at the time). At that time the quicksort algorithm gave up after a set number of iterations, and used a simple bubble sort to complete the sort. The source code comments claimed that after a certain number of quicksort passes, the data was setup in such a way that a bubble sort was faster than continuing with the quicksort algorithm.

That's something else... Basicly, quicksort scales much better than bubblesort, but for really small lists, bubblesort can be faster than quicksort. When recursing, eventually quicksort is dealing with data small enough that it's faster to pass that data off to bubblesort rather than recursing more.

That's not the same as intro sort though. Intro sort does quicksort but checks if it has recursed far more deeply than average case would, and then falls back to a guaranteed O(n*log(n)) algorithm (usually heapsort) so that it is never more than a constant multiple worse than the average case. It never uses bubblesort as the fallback - trying to use an algorithm with O(n**2) scaling in order to avoid O(n**2) scaling is a losing proposition.

[edit] (Intro sort may also use bubble sort or some other "O(n**2) but fast" algorithm for small subsets of data just like quicksort might, but still also requires a guaranteed O(n*log(n)) fallback for the too-much-recursion case. And that fallback might in turn use a O(n**2) algorithm for small subsets of data.) [/edit]

We've spent decades coming up with more abstraction layers and with higher level languages... and here we are looking at the quicksort C function to see if our code written in high level languages is vulnerable.