The Unordered Data Structures course covers the data structures and algorithms needed to implement hash tables, disjoint sets and graphs. These fundamental data structures are useful for unordered data. For example, a hash table provides immediate access to data indexed by an arbitrary key value, that could be a number (such as a memory address for cached memory), a URL (such as for a web cache) or a dictionary. Graphs are used to represent relationships between items, and this course covers several different data structures for representing graphs and several different algorithms for traversing graphs, including finding the shortest route from one node to another node. These graph algorithms will also depend on another concept called disjoint sets, so this course will also cover its data structure and associated algorithms.

Преподаватели

Wade Fagen-Ulmschneider

Teaching Assistant Professor

Текст видео

[MUSIC] To really understand what a hash function does, let's take a look at a couple different hash functions through examples of data being put into a hash table. So this first example is a number of faculty who teach courses here at the University of Illinois. Here at the University of Illinois, there are courses like 241, which is taught by Professor Lawrence Angrave. 421, Professor Beckman. 210, Professor Cunningham. 101, Professor Davis. Carl Evans teaches 126. Professor Fagen-Ulmschneider teach 225. And we can see the list goes on and on and on. Here we need to figure out a function that allows us to map all of these names into an array index. So if you noticed, I've chosen particularly unique set of professors and listed them in particularly special order so we can have a really great function. Notice that there is exactly one professor for every single name in the alphabet. Our key function here may be the function where we look at our key string and we take index zero and we subtract the character value A. The Angrave is going to be character value A minus character value A which equals zero. So our key Angrave is going to be placed in index zero and his value is 241. Beckman, B, B minus A, is going to be equal one, and Beckman can be placed into the second index or second position, index one in our array for 421. And we can continue this on, and you'll notice that C goes into index 2, D, E, F, G, and H. Each of these letters have a perfect mapping into our array from our key space. And in fact every single element in our array is filled and has a unique mapping back to the original data. This idealized hash function, this perfect hash function, is something we mathematically call an onto function. That every single element in the array is full, and we can map every single element into our data onto that array. This is an amazing set up and it's going to be the absolute goal of what we want. But there's a slight problem, a new faculty might have arrived who had a name that's similar to one of these names. When we have to consider that faculty member, there's no longer space. So if we have a new faculty member Professor Charles, Professor Charles is unfortunately going to land right here where Professor Cunningham is currently at. That's a problem, and we're going to need to handle that collision. So let's remember that problem, and we'll deal with it again in a second. A second hash function is one of my favorite dice games. So whenever I have a group of people, and if you were here in person, I would roll a number of dice for you. And I would show the result of this dice and give you a number of what that dice says. So here, let's look at this set of dice. Here we have a dice that has the pips one up, two, three, four, and one. I would give you the number two as the mapping to this set of dice. I can roll this dice a number of times and give you a different set of numbers. And what I'm doing is something called petals around the rose. This is a particularly fun example because it will take a while for people to figure out. So I encourage you to try it. But what I'm doing is I'm simply counting any dice that has a center pip that's active, such as the three, and counting how many petals are around that pip. Here one, it does have the center node, but it has no petals around it. Four doesn't have the center node, so I'm not counting it. Only three and the dice five are counted in petals around the rose. So here, my hash function, is the function petals around the rose. This function allows us to look at any set of inputs, here input are dices, and map that to a value inside of our array. So look at this first set where I have the dice one, two, three, four, and one. And my hash function as we discussed earlier maps it to two. So that means here in two, I have the value one, dice two, dice three, dice four, and dice one again. So we know that value two is going to map to that data or some other information associated with it. One thing I might do is actually just keep a tally of exactly how many there are. So let's actually discuss some things that are going to happen to this hash table. So given that we know that there's always going to be either two or four petals around the rose, think about what happens for the hash value one. Will anything ever get mapped to one? It won't. The only way we're going to count petals is for a three we count two or for a five we count four. So every single odd number is never going to be reached in our hash table. This hash table is not going to be perfectly filled. And if we think about all the possible rolls of a dice, think about the fact that we may have a lot of rolls that contain just a single five. So there may be a lot of different values that get mapped to the value four. Both of these problems are concerns that we need to have to ensure that we have great hash function. We need to consider a hash function that works really, really well for all of our keyspaces. And there's going to be three characteristics that we can analyze to determine whether or not we have a great hash function. So to dive into this hash function, we want to look at the hash function in two different pieces. The first piece of the hash function that we need to look at, is we need to look at the function itself. How we transform whatever our input is into a number. So we say the hash itself is going to transform an input into an integer. You know this initial transformation, I'm often going to not worry about the balance of that integer. So I just need some generic functions that's going to turn whatever my input is into a numeric form, into insure form. The second part of my hash function is going to be a compression to make sure that the hash function is within the bounds of the array. The compression can be easily done with a mod operator. So I can do that using mod N. So I encourage you, as you're thinking of hash functions, do not create any new hash functions yourself yet. That is actually really, really hard to make a good hash function and there's some amazing hash functions out there. So we want to understand how to analyze a hash function before we go about creating our own. As I mentioned earlier, there are three different characteristics that we're going to care about when we're building a hash function. The first one is, is we need to make sure our hash function runs in constant time. We want to ensure the time to compute a hash is going to be o of one. We need to absolutely make sure our hash function runs in constant time, and we prefer if this was a very, very, very fast function. It is absolutely essential that the computation time of the hash is extremely quick. If we're spending a long time computing the hash, then we're going to spend a long time on this algorithm because we have to compute the hash every time we see a piece of data. So the first thing that makes a great hash function is that the hash function must run in o of one time. The second thing we need to absolutely make sure is true about a hash function is that a hash function must be deterministic. What this means is if we hash a string once, and we hash the exact same string a second time, those two results must be exactly the same. What we cannot do it throw a random number in there. We might love to do this because it's so randomize exactly what our input is in the result, but by throwing a random number in there we no longer have a deterministic hash. Every time you want to hash number 103, or hash a string Wade, you absolutely must ensure that that string comes out at the other side as the same index into our array. The hash function's second requirement is it must be deterministic. The third requirement is the hardest requirement to ensure it is true. And this requirement is called the simple uniform hashing assumption, or SUHA. The simple uniform hashing assumption says that the result of our hash algorithm must be uniform across the entire key space. So what that means is, under SUHA, it means that given two values, so we're going to take the hash of some value a and the second hash function of b. When we hash these two functions, we need the probability of the hash of a to be equal to the hash of b, to be equal to one over the size of the array, if a does not equal b. So if we have two different values, the symbol uniform hashing assumption says that at random two values are going to randomly be in different place in an array with equal probability. Any time you have bunching of data, such as the pedals around the rose example earlier, where you have multiple pieces of data hashed in the same value, and some of that data never hashing to odd numbers. We no longer have a simple uniform hashing assumption. Because you can see the probability of hashing a one is zero, while the probability of hashing values like two or four was quite high. So if we have a function that runs in constant time, is deterministic, and satisfies a simple uniform hashing assumption, then we have an absolutely great start to building a great hash function. We'll largely use other people's hash functions for most of our work here since it's so hard to ensure that uniform distribution. But we'll do some analysis on hashing functions to understand exactly their run time. We'll talk about some of this analysis and dive into a few more examples in the next lecture. I'll see you then. [MUSIC]