The Unordered Data Structures course covers the data structures and algorithms needed to implement hash tables, disjoint sets and graphs. These fundamental data structures are useful for unordered data. For example, a hash table provides immediate access to data indexed by an arbitrary key value, that could be a number (such as a memory address for cached memory), a URL (such as for a web cache) or a dictionary. Graphs are used to represent relationships between items, and this course covers several different data structures for representing graphs and several different algorithms for traversing graphs, including finding the shortest route from one node to another node. These graph algorithms will also depend on another concept called disjoint sets, so this course will also cover its data structure and associated algorithms.

강사:

Wade Fagen-Ulmschneider

스크립트

[MUSIC] A second way to deal with collisions inside of an array is to use something called Probe Based Hashing. This collision strategy doesn't use a list whatsoever. Instead, we're going to develop an entirely new strategy to deal with this. Do that, we're going to consider an array just like before but now our array is actually going to store the data as opposed to being a lengths list. And let's take a similar input set and run it through a similar hash function and use a different strategy when we run into a collision. Here we have the same set of data and our same hash function, I'm going to go ahead and run through this just as before, 16 mod 7 is going to be 2, so I'm going to add 16 to index 2. 8 mod 7 is going to be 1, so I'm going to add the value 8 to the index 1. 4 mod 7 is 4, I'm going to add 4 to index 4. And then 13 mod 7 is going to be 6, I'm going to add 13 in index 6. Now, we're at 29, 29 mod 7 is 1. And now we go to our new strategy. In probe-based hashing, if we run into a collision, instead of simply using a linked list to handle a collision. We're going to simply look at the next location inside of the array. By looking at the next location, we're just going to probe ahead until we find an empty slot. Once we find that empty slot, we can write the value right there. So looking at this example, here at 1, we can't enter 29 at 1, because we already 8 there, so we need to probe ahead. 16 is also filled up, so probing ahead again, we can finally put 29 in index 3. Next, the 11, 11 mod 7 is 4. We look at index 4, we find index 4 is already filled up. We probe ahead and find that we can put 11 in index 5. Finally, we find the number 22. 22 mod 7 is 1, we have to start at 1, and now probe ahead until we find an empty slot. Notice that we have to go through the entire array, loop all the way around until we get to 0 before we can insert the value 22. We had to look at every single element in the array. Only our very last chance were we able to enter that data. You can imagine this is going to cause some problems. We may not want to just always insert our data at the same location, and we're going to find the runtime of this is going to be problematic. So we're going to express this mathematically by saying this simple function right here, that we're going to try to do a single step every single time. But what's going to happen with linear probing, when we probe one element at a time is we have something called a primary cluster that's forming. You already saw an intuition of this when we went through the example earlier. The intuition here is that as soon as you start getting a pack of numbers filling up we're going to see that a lot of things are going to have to be approved until the end of the pack. So if you imagine this array, we have a number of elements that have been filling up right here in this space. It's getting pretty crowded, lots of things are hashing here. We may hash another value here. So now if we hash to anything in this point, it's always going to be probed until this point right here. So we have something called a primary cluster. The idea that even with the uniform distribution that based on the laws of probability, we were going to end up with a cluster that has a primary set of values all in a single block while the rest of the arrays is going to be fairly sparse. The remedy to this Is we want to make sure our probing isn't exactly linear. So let's do something called double hashing to fix this problem. In double hashing, instead of just having a linear probe every single time, we're going to have a secondary hash function that's going to allow us to hash into a new index that's not necessarily immediately following the other points. So let's look at that, here we can see that instead of saying k + 0 times our current number of hashing, this was 0, 1, 2, we're going to multiply 0 by a new hash function. So here, given a key, and giving the number of times our hashing it, we have h1 is going to be our first hash function, this is k mod 7. While h2 will be our second hash function, by how far we need to jump at each point. So in our second hash function, we want this to be an output of some value that's going to be not 0 and less than the size of the array. So one can imagine that our second hash function might be something like our mod value, 5 minus our k mod 5. So here we know that k mod 5 has the output range of 0, 1, 2, 3, or 4. So 5 minus 4 is 1, 5 minus 0 is 5. So our step function is either going to be 1, 2, 3, or 4, or 5 when we do double hashing. Diving into this example, to really see what's going on, we're going to start with 16. 16 mod 7 is still going to be 2. Then we go ahead and insert 16. 8 mod 7 is 1, we enter 8 at index 1. 13 mod 7 is 6, we insert 13 at index 6. And I skipped 4, but 4 is totally fine to do it now. 4 mod 7 is 4, and we stick 4 right here in this 4. Now, get to a point where our collisions begin. 29 mod 7 is 1, and now we have collision at 1. So because we don't want to simply do linear probing, now, we need to apply double hashing. That means we're going to apply a second hash function. Let's go ahead and do a second hash function. 29 mod 5 is going to be 4. 5 minus 4 Is 1. So we're going to do linear probing on 29. Our step function's 1. That's fine, we only need to jump two spots. We can go ahead and put 29. Second number we get is 11. So 11, using our original hash function, 11 mod 7 is 4. 4 is filled up. By using double hashing, we now are going to use our second hash function, we have 11 mod 5, which is 1. 5 minus 1 is 4, so our step function is 4. So looking at index 4, we're going to jump ahead 1, 2, 3, 4 indices, and then jump 4 more ahead, 3, 4, and find that we go ahead and insert the number 11 right here immediately following 4 but we did so using a different process. What we find is when we do this on a large scale that this approach of randomizing how much we step in each time is going to avoid the idea of a primary cluster. Though the big idea you should take away from this is no matter what strategy we use, as our hash table becomes full we're going to see our performance degrade. And we can actually quantify this by looking at the load factor and the amount of collisions that are required to resolve the collision to find an open spot as functions. I've done this for you, and we did a lot of math to get to this point to find out the exact equations using linear probing, double hashing, and separate chaining to find out the total number of collisions that are expected, and the running time to insert using these different techniques. The goal of these equations is not to understand the equations themselves but to understand how the equations behave as the load factor of our hash table changes. So what this means is we can look at these equations and see where is alpha in all of these equations. So here, alpha is in the denominator, alpha is in the denominator. So as alpha increases, you're going to see that the running time is going to be 1 minus alpha, 1 minus alpha, 1 minus alpha. So the running time is going to get worse the larger a load factor is. This intuitively makes sense, that as alpha increases, the running time of our hash table gets worse. And I don't have an intuition of exactly how these functions work. So I went ahead and graphed these functions, so that we can actually see a graphical implementation of these functions. Using a graphical representation, I can see the load factor with linear probing shows that there's an absolutely small amount of running time as we go up alpha, so alpha is here on the x axis. As we increase alpha, running time is here on the y axis. As alpha increases, the running time increases, as well. Likewise, we're double hashing, as alpha increases, it's flat for a while, and then the running time explodes as alpha gets close to 1 point out. What this really shows is, as long as we manage alpha, we can have a very predictable run time of our algorithm. So specifically, let's look at having alpha be at the value of 0.6. So if we manage alpha to always be at or below the value of 0.6. Notice that the running time of this algorithm is extremely fast. These are absolutely as amazing running times given an alpha value less than 0.6. What this means is it doesn't matter how much data's in our array. Our array can have a billion records, but because alpha is only the ratio between the actual amount of data in an array and the size of the array. As long as we can keep expanding our array, we can find that our running time is absolutely phenomenal, no matter how much data is inside of our data structure. Inside of our hash table, if we have a billion records or just a hundred records, the running time is determined only upon the ratio of the amount of data in our hash table to the size of the array and not the actual amount of data itself. So what this means is that the running time proportional to only an amount of data for any hashing technique as long as we keep alpha constant. That running time is going to have and O(1) running time proportional to the data. Because the running time is going to be proportional to alpha, not proportional to the amount of data in the array. The one last little bit to complete this whole idea is that we are going to have to resize the array every so often. So when we resize the array we're going to have to be very, very careful about resizing this array. Because if we simply resize the array by copying every value over, as we've done with every other resize, you'll notice that resizing the array is going to change where values get hashed to because we have a compression step. That says, whatever the hash value is, we're going to modify some value. The one important part to think about is what happens when we do have to resize the array. Because the one thing we know is, we absolutely need to maintain that ratio, that alpha value, between the amount of data inside of our table, and the actual size of our array. As we learned earlier, when we resized an array, we're always going to want to double that array. And in doubling that array, we're going to have to do something called rehashing. So rehashing is the idea that if we take a value from the array, the original spot it hashed to is not necessarily going to be the new spot it hashes to, due to the compression aspect inside of our hash function. Remember, hash function gets into integers and then mods it by a value. If we consider this example right here, we might have originally hashed something to index 1. When we expand the array, that hash of index 1 may have been at one or it may have been at index 8 or 9, depending on the size of the array. Because of this, we need to make sure that when we expand the size of the array, we go through and rehash every single value in the array to make sure it's in the proper spot. Right now we have a complete understanding of the entire hash table system. We know that we need a great hash function that spreads our data uniformly across the table. We know that function needs to be quick, deterministic and satisfy SUHA. We know we have an array that we have to maintain the size of that array to be keep a great ratio. And alpha value less than 0.6 to have extremely strong performance. And we know what happened, what we need to do when the collisions do happen. But we have different strategies, either using a linked list or using linear probing or double hashing to handle these collisions. All of this gives us an amazing opportunity to develop a really awesome algorithm, do some really awesome things with hashing. Well, in the next video, we'll discuss a final little bit of hashing and do some analysis of the entire system as a whole. So I'll see you then. [MUSIC]