I have been interested in Information Security. I was recently introduced to the idea of hashing. What I currently understand about hashing is that it takes the password a user enters. Then it randomly generates a "hash" using a bunch of variables and scrambling everything. Then when you enter this password to log in it matches that password to the hash. There are just a couple of things I don't understand about it.

Why is it so hard to crack these hashes? I would assume once you
found the method they are using to encrypt it (lets go with an
extremely simple one like Caesar's cipher once you find out how many
you have to shift over you can do it for whole books). Even if it
uses something like time and jumbles it there are some really big
ways you can limit the options (Lets use the Caesar cipher they're
using the year mod x you already know that there are two possible
years realistically then you just have to figure out the second
piece of the puzzle).

If they are generated randomly (even if two passwords are the
same they come out differently) how can they tell if it's correct?

How are they cracked. How does hash cat know when it has
successfully decrypt the password?

As a tiny answer to Q(3) more specifically programs like oclHashcat try millions of hashes in a predetermined list in most cases. They never actually 'decrypt' the password (remember you can only decrypt encryption - hashing != encryption), but they know if they try a password and the resulting hash matches they one they have, it must have been the original password. I.e. They don't decrypt, they do trial and error millions of times a second to see if they can get a match. This is why it's also good for a hash to be slow.
–
PeleusApr 7 '13 at 5:36

@Peleus This is a lot like what I was getting at. The only thing is I thought that when hashing the password they scramble it randomly. How do they take the password and re-scramble it with the same random movements. And if the same input can give a different output that confuses me also.
–
Griffin NowakApr 7 '13 at 17:52

I'm not sure if you're saying "I thought they scrambled it randomly" as in you've learnt differently now, but just so you know it's definitely not the case! Hashing is not random, it's repeatable - but it's impossible to work backwards that's all. A SHA256 hash of the word 'cat' will always be the same 100% of the time. That's why we can use them reliably for passwords. If the hash produced a new value every time, and we could only compare against a previous hash value, we'd never know if the password was right or not! :D
–
PeleusApr 8 '13 at 0:19

5 Answers
5

Which one is easier? It's easier to perform a multiplication (just follow the rules mechanically) than to recover the operands given only the product. Multiplication. (This, by the way, is the foundation of some cryptographic algorithms such as RSA.)

Cryptographic hash functions have different mathematical foundations, but they have the same property: they're easy to compute going forward (calculate H(x) given x), but practically impossible to compute going backward (given y, calculate x such that H(x) = y). In fact, one of the signs of a good cryptographic hash function is that there is no better way to find x than trying them all and computing H(x) until you find a match.

Another important property of hash functions is that two different inputs have different hashes. So if H(x1) = H(x2), we can conclude that x1 = x2. Mathematically speaking, this is impossible — if the inputs are longer than the length of the hash, there have to be collisions. But with a good cryptographic hash function, there is no known way of finding a collision with all the computing resources in the world.

Note that a hash function is not an encryption function. Encryption implies that you can decrypt (if you know the key). With a hash, there's no magical number that lets you go back.

The main recommended cryptographic hash functions are SHA-1 and the SHA-2 family (which comes in several output sizes, mainly SHA-256 and SHA-512). MD5 is an older one, now deprecated because it has known collisions. Ultimately, there is no mathematical proof that they are indeed good cryptographic hash functions, only a widespread belief because many professional cryptographers have spent years of their life trying, and failing, to break them.

Ok, that's one part of the story. Now a password hash is not directly a cryptographic hash function. A password hash function (PHF) takes two inputs: the password, and a salt. The salt is randomly generated when the user picks his password, and it is stored together with the hashed password PHF(password, salt). (What matters is that two different accounts always have different salts, and randomly generating a sufficiently large salt is a good way to have this property with overwhelming probability.) When the user logs in again, the verification system reads the salt from the password database, computes PHF(password, salt), and verifies that the result is what is stored in the database.

This answers (2) and (3) — the legitimate verifier and the attacker find out in the same way whether the password (entered by the user, or guessed by the attacker) is correct. A final point in the story: a good password hash function has an additional property, it must be slow. The legitimate server only needs to compute it once per login attempt, whereas an attacker has to compute it once per guess, so the slowness hurts the attacker more (which is necessary, because the attacker typically has more, specialized hardware).

Damn I come to the security site from all the others and the one thing that is very clear is you guys put an insane amount of work into answering. Not only correctly but extremely thorough. I wish I could select two answers but yours was far more like what I was looking for.
–
Griffin NowakApr 6 '13 at 22:06

@Griffin - You can up-vote both, though. Or indeed - when there's more than two answers - up-vote all that you feel that they were helpful, even if you can accept only one. Many question here have more than one good answer, and sometimes it's even recommended to read most of the answers to get a better understanding of the topic on hand. Yes, sometimes even the down-voted ones. By your voting (either way), you also help future readers decide on the validity of answers, especially those readers that are still learning about a certain topic. ;)
–
TildalWaveApr 6 '13 at 22:25

I up voted both! They were extremely useful.
–
Griffin NowakApr 6 '13 at 22:35

+1: All the answers are good, but this one is about as close to a perfect answer as I've ever seen on Stack Exchange. Would +10 if I could.
–
Ilmari KaronenApr 7 '13 at 13:19

Cryptographic hash functions are mathematical objects which can be described as "a big mixing and scrambling of some bits". They take as input a sequence of bits (possibly a very long one) and offer an output of fixed size. Roughly speaking, they are so tangled that although there is nothing secret about them (that's just deterministic code), nobody can figure out how to "invert" them (find a matching input for a given output) except by the basic method called "luck": try random inputs until a match is found.

How it may happen, scientifically, that hash functions can exist at all is a good question.

Hashing is not encryption. There is no secret, no key in hashing.

Hash functions have many uses; one of them is "password storage". A hash function looks like a good thing for password storage. We do not want to store passwords directly (otherwise an occasional peek at our databases by the attacker would give him too much information; see this blog post for a discussion); we want to store password verification tokens: something which allows for the verification of a password (that the user presents) but does not reveal the password itself. So the idea is: let's store the hash of the password. When a password is to be verified, we just compute its hash and see if it matches the stored value. But guessing the password from the hash value only is hard, since the hash function is resilient against "inversion" (see above).

Since passwords are a special kind of data (that's data which humans can remember), for proper security, we need a "strengthened" hash function:

We want a very slow hash function.

We do not want one hash function, but many distinct hash functions, so that each password will be hashed with its own hash function; this is about deterring parallel attacks. This process of turning a single hash function into many variants is called salting.

See this answer for a thorough treatment of the subject of hashing passwords.

Hashing is a function from some bit string (usually variable length) to another bit string (usually smaller, and of fixed length).

Hashing is used in databases for data retrieval, and in in-memory data structures called hash tables. It allows us to reduce arbitrary data, such as a character string or a complicated object with many fields, to a binary number which can then be used directly as an index into a sparse array to fetch the associated data (with some details for handling hash collisions).

The hashing functions used in the above manner are "cousins" of cryptographic hashing functions. They are designed to different requirements. They must be fast to compute, and achieve a good distribution.

In secure computing, cryptographic hashes are used to digest data into some representative, small bitstring. Cryptographic functions have different requirements. They are designed to be difficult to reverse (to be "trap door" or "one way" functions). Not only that, but an important requirement is that it has to be difficult to find, for a given plaintext and hash value, another plaintext which produces the same hash.

Hashing can be used not only for passwords, but as a checksum for verifying data integrity and as part of the implementation of digital signatures. To digitally sign a large document, we simply have to hash the document to produce a "digest" (a name used for the output of a hashing function, when something very long is hashed). Then just this digest is put through the public key crypto-system to produce a signature. You can see the weakness there: what if an attacker succeeds in producing a document which has the same digest? Then it looks like the original signature produced over the genuine document is actually a signature of a counterfeit document: a signature-transplanting forgery has been effectively perpetrated.

Password hashing allows systems not to store the plain text version of a password, yet enables them to verify whether the user trying to gain entry knows that password. Not only does hashing allow systems not to store the plain text passwords (which would have to be very carefully guarded) but it allows for the possibility that even if the hashes are publicly exposed, the passwords are still secure (similarly to how public key crypto systems are able to reveal public keys). Though in practice, hashes are nevertheless protected from public access: for instance /etc/shadow files on Unix-like systems, supplementing world-readable /etc/passwd files.

The hashing function is anything but random. However, randomization is employed to thwart attackers who build large dictionaries of passwords and hashes, that enable them to look up a hash code and retrieve the corresponding password.

To hash a password more securely, we can simply add some random bits to it called a "salt". Different salts added to the same password, of course, lead to different hashes (hopefully with few or no collisions).

If the random salt is, say, 32 bits wide, it means that, in theory, one password can hash in over four billion different ways, making it very impractical to have a precomputed dictionary of all possible hashes of a large number of passwords.

Of course, when the user is being authenticated, she does not know anything about this salt. That is okay because the salt is stored along with the hash in the user's profile (often, combined with the hash into a single compact bitstring). When the user's password entry is being validated, the salt is added to whatever password she entered, so that the hashing is carried out with the correct salt. If the password is correct, the hash will match, since the salt being used is the right one also, having been pulled from the user's profile.

So that is how randomness is incorporated into password hashing, while still allowing it to work.

What makes hashes hard to crack is that they are built from "trap door" or "one way" functions. In mathematics, there are many examples of such things. For instance, simple addition is a trap door. If we add some integers to produce a sum, it is impossible to recover the original numbers, knowing only the sum.

Password hashes are not encrypted passwords. If an attacker has the hash and salt of a password, and happens to guess the password, then she can easily confirm this, exactly in the same way that the login authenticator software does it: she runs the password plus salt through the hashing function and sees that the correct hash emerges.

Excellent writing skills and a really easy to understand answer that is throughout factually correct, yet tackles all points and retains a natural flow to it that makes it so much more comprehensive. That's not an easy feat, thanks so much for your answer!
–
TildalWaveApr 7 '13 at 5:19

very informative.you covered all the aspects.
–
ShurmajeeApr 15 '13 at 6:51

One of the keys to hashing is that it throws away information. You can't reverse a hash because the necessary knowledge is gone. Here's a few examples of workable (but pretty worthless) hashing functions. If you give me a password I could do something like on of the following:

Count the number of vowels

Take the ASCII code for each letter and XOR them all together

Take the CRC32 checksum of the binary representation of the password (this one is actually a real hash, just not a cryptographic one)

In each of these instances, I can't reverse the process. Instead, I have to re-run the process when you give me the password again later to see if the calculation I ran matches.

For example: If you initially give me the password "monkey", might store the number 3 (3 vowels). Then, when later try to authenticate with the password "dragon", I run that same check again and come up with 2, which doesn't match 3. So I know you game me the wrong password. But if you give me the password "melissa", I would incorrectly assume that you typed in the right password. This is a hash collision.

The set of rules you apply to come up with the number the represents a given password is your hash function. High-quality hash functions are designed to limit the number of potential collisions, so that you don't have to worry about that problem. A step further, cryptographic hash functions are designed to make it difficult to come up with string that might match a given output (and perhaps intentionally create collisions). They also are designed to limit the amount of information you can glean about a given input from just the hash output.

So as a result, the only way to tell what password matches a given cryptographic hash is to try all of the possibilities until you stumble upon one that works. Further countermeasures (salt, BPKDF2, etc) make this guessing process even harder by making the person guessing the password jump through more hoops for each try.

Note that I completely glossed over how a cryptographic hash function makes it difficult to come up with a working password (even if it's not the original one). This is called a "preimage attack". In the trivial example above, coming up with "melissa" as a candidate password containing 3 vowels is an example of such an attack.

Cryptographic hash functions typically do this by running the input though several "rounds" of a given process, where the output of each round becomes part of the input to the next one. To figure out the input of the first round, you'd have to figure out the input of the second round, which in turn requires you to figure out the input of the third round, etc., which means that each guess of each component has to be checked through a long and complex set of computations. Thomas Pornin has a pretty exhaustive explanation of how this resistance works; pretty useful reading if you want to really understand it.

The value of y, which will reduce this to a single variable equation, making solving it for that single variable trivial to any 6th-grader (possibly needing a calculator), is a secret that I have shared only with people I trust. Without it, z could be anything; its value is dependent on y and so it cannot be satisfactorily solved without a constant, known y. If you don't know y's value, it's because I haven't trusted you enough to give it to you in private.

This is the basic principle of cryptography; the mathematical formula or other deterministic process is well-documented, and one or more of the possible variables of the formula are also allowed to be publicly known, allowing the two parties to agree on a way to set up their ciphers so that each can decrypt what the other encrypts. However, two variables remain secret; if you know one, you can discover the other. The one you should know is the key, and the one you can discover with the key is the message.

For a hash, it's a little different. A hash doesn't require one secret to be kept in order to keep another. Instead, hashes work based on an irreversible mathematical transformation; for any H(x) = y, there is no known H-1(y)=x. Usually, this is because several intermediate results of the equation are ambiguous; for instance, calculating the square root of a positive number technically produces both a positive and negative result, since either number could be multiplied by itself to produce the result. The inverse of a modulus is similarly ambiguous; the number 1, produced by x mod 3, could have been produced by any x = 3k+1. These types of "one-way" transformations are combined in such a way that trying to calculate the inverse hash function generates infinite possibilities; the easier (easiest) way to solve them is therefore to simply try every possible input until one output matches. This still takes a long time.

Hashes aren't random. As I previously stated, hashes are the result of an irreversible mathematical operation. That operation must still be deterministic; given a constant input, the output is constant regardless of how many times you perform the operation. There is no random component.

Where you might have been confused is in the term for what a hash simulates, which is a random oracle. Picture a black box, inside which is a little man with a photographic memory and some mystical method of generating perfectly random numbers. You write something down on a piece of paper, and push it through a slot where the man gets it. He reads it, and one of two things happens. Either he hasn't read it before, in which case he will generate a new random number and give it to you, committing both your message and the number to his memory. Or, he has read this exact message before, in which case he remembers the number he generated the first time he read it and gives you tht number. The random number generator will never generate a number it has already generated, it has infinite possible magnitude, and the little man's memory is unlimited and infallible. Therefore, the little man will never think he's read a message before if he hasn't, never forget he's read a message before, and so will never, ever, produce two different numbers for the exact same message nor the same number for two different messages.

This is what hash functions try to simulate. They can't model this little man with the photographic memory, because that would require infinite storage space and unlimited, universal availability, even to devices that aren't connected to any other device in any other way. Instead, they rely on a deterministic but random-looking calculation that "digests" the message into its hash value. The same hash function, given the same message, will produce the same digest; however, these functions are limited in the number of hash values they are allowed to return. This creates the possibility of what we call hash collisions; there are more possible messages than hash values, so sooner or later (hopefully later), two different messages will produce the same hash.

Hashes can be cracked for three basic reasons. First, because they are a deterministic, mathematical derivation of their message, mathematicians (and thus attackers) eventually find a mathematical relationship between a message and its hash, or between two messages and their resulting hashes. What was once random looking is no longer so. That would allow for a number of attacks based on the nature of the weakness found; if there is an algorithmic way, given a message and its hash, to generate a colliding message, that is a problem. If there is a way to manipulate a message and predict the resulting hash, that is a different problem. If there is in fact a way to reverse the hash, producing a message from the hash that, when re-hashed, produces the same hash, that's a serious problem.

Second, because hashes have a limited digest size, sooner or later, two messages will produce the same hash. That means that an attacker doesn't have to find the message that you use to produce a certain hash; all he has to do is find a message that produces the same hash. The odds of this are slim, theoretically one chance out of however many possible hashes there are, but still better than one in infinity.

Lastly, while there are a lot of possible messages, there are a far smaller number of probable messages. The messages we typically give to hash functions usually have some structure (based on language, subject matter, electronic formatting, and purpose), which means that, given some part of the message, we can more accurately guess other parts of the message. This means, in information science terms, that messages which are converted to hashes often have lower entropy than the hash function itself; plainly put, a hash function that produces 256-bit digests can theoretically produce any permutation of those bits, 2^256. However, if there are, say, only 10,000 possible messages that could ever be input into this hash function by a system being studied for attack, then only 10,000 of the 2^256 possible hash values will ever be seen, and more importantly, an attacker would, worst-case, only have to try all 10,000 possible inputs to find the one that produces the hash value he is looking for.

And this.... is why I love IT security's stack exchange site thing.
–
Griffin NowakMay 12 '13 at 15:08

Also your explanation of #1 is exactly what I needed. However I do have a question. It seems that "hashes" are like number versions for a given thing (passwords in this case). So if I have a website and 100000 people sign up. Then 50% use the password "password" I am able to save a ton of space by just storing the hashed value of "password" instead of password a ton of times?
–
Griffin NowakMay 12 '13 at 15:15

Well if you're using a secure hash (>=256-bit digest size) then storing the hashed value of "password" is going to increase your storage size. In addition, if an attacker were ever to see that 50% of the user accounts had the same password hash, he'd know that all he'd have to do is crack one password and he has access to 50% of the user accounts. You should be "salting" your password hashes; there are a variety of methods, but the end result is that the same password hashed by the same algorithm produces a different digest, because of an additional unique salt value for each account.
–
KeithSMay 13 '13 at 14:42