Um no, that’s exactly what people are pointing out. Read it. If you type á you may get [U+00E1] or you may get [U+0061, U+0301]. Or consider the difference between μ and µ (that’s U+03BC and U+00B5 respectively). You will not tell that difference on a keyboard. The AZERTY layout include µ, want to guess which one that is?

Comparing strings for equality is collation.

Maybe it will work if passwords are always normalized to a certain normalisation form before hashing. Whether to use Compatibility or Canonical form will be a long discussion.

Agreed that normalization is the first problem: You certainly want to choose whether è is written as one character (as on Windows, NFC) or as (roughly) e (with a special) as on Mac (NFD). I’ll ignore Compatibility (NFC vs NFKC) since I get lost there.

Then, when we advocate for Unicode passwords, we’re talking of websites with crap password rules (hence bad developers, or good developers hamstrung by crappy processes or managers), and we’re somehow expecting them to handle Unicode correctly! And they need to do it across your clients (including broken ones), server-side software and databases (which are likely in PHP or Python 2, so they suck at UTF8). You also want to normalize Unicode—and this can magnifies hidden encoding bugs.

Or consider the difference between μ and µ (that’s U+03BC and U+00B5 respectively). You will not tell that difference on a keyboard. The AZERTY layout include µ, want to guess which one that is?

I think one thing that’s not been explored much that I’m aware of is rather than a bunch of weird rules, including length, it might be of value to make sure that not too many people have the same password, which should be easily implemented with unsalted hashes. Maybe it’s the unsalted hashes that make that unpalatable?

No offense Jeff, but you are behind the times. You should do some research on the state of the art in password cracking. Password = “correct horse battery staple”. Cracked. Password = “Through 20 years of effort, we’ve successfully trained everyone to use passwords that are hard for humans to remember, but easy for computers to guess”. Cracked. Any sentence written in this Discourse discussion used for a password. Cracked.

See, some time ago, password crackers started polling all books in existence and many websites and forums for phrases including combinations of nonsensical words. Using a long password but one made of words will no longer protect you.

What is really needed is a second factor and/or a separate authentication store (such as your email provider, Google, etc.). Two-factor is why we can have four digit pins with Debit cards and still be relatively safe. If you add a second factor, the problems of password complexity go away. Delegate authentication to another provider (email, Google etc.), and again, (your) password complexity requirements go away.

As does often happen in software development, the best solution is to not build something new and instead rethink the problem.

See, some time ago, password crackers started polling all books in existence and many websites and forums for phrases including combinations of nonsensical words. Using a long password but one made of words will no longer protect you

He typically uses a dictionary file containing about 26 million words, combined with programming rules that greatly extend its effectiveness by adding numbers, punctuation, and other characters to each list entry. Depending on the job, he sometimes uses a 60 million-strong word list and something known as “rainbow tables,” which are described later in this article.
…
Gone were word lists compiled from Webster’s and other dictionaries that were then modified in hopes of mimicking the words people actually used to access their e-mail and other online services. In their place went a single collection of letters, numbers, and symbols—including everything from pet names to cartoon characters—that would seed future password attacks.

As big as the word lists that all three crackers in this article wielded—close to 1 billion strong in the case of Gosney and Steube
…
The specific type of hybrid attack that cracked that password is known as a combinator attack. It combines each word in a dictionary with every other word in the dictionary. Because these attacks are capable of generating a huge number of guesses—the square of the number of words in the dict—crackers often work with smaller word lists or simply terminate a run in progress once things start slowing down. Other times, they combine words from one big dictionary with words from a smaller one. Steube was able to crack “momof3g8kids” because he had “momof3g” in his 111 million dict and “8kids” in a smaller dict.
…
As Ars explained recently, the problem with password strength meters found on many websites is they use the total number of combinations required in a brute-force crack to gauge a password’s strength. What the meters fail to account for is that the patterns people employ to make their passwords memorable frequently lead to passcodes that are highly susceptible to much more efficient types of attacks.

oh… yeah… ok. I’ve actually done that myself. I had a client who’s
DNS server crashed and I needed to change their DNS settings, but the lady
in charge of the domain settings wasn’t available, so we couldn’t log into
the domain provider to change the DNS. I ended up going into the company’s
old intranet system, pulling her account MD5 'encrypted" password and
putting it into a variety of md5 decoders. And, surprise surprise, she
uses the same password. We got right in and changed it.

Since then, this client actually outsourced their system to some group of
independent foreign “programmers” (who seem to all be teenagers). (I told
the client of the danger and then cancelled our contact to avoid any legal
repercussions.)

See, some time ago, password crackers started polling all books in existence and many websites and forums for phrases including combinations of nonsensical words

And yet your “examples” say…

dictionary file containing about 26 million words, combined with programming rules that greatly extend its effectiveness by adding numbers, punctuation, and other characters

and

It combines each word in a dictionary with every other word in the dictionary

That’s a total of two words combined and some mutations, absolutely not

polling all books in existence and many websites and forums for phrases including combinations of nonsensical words

Combining two words and mutating seems reasonable, let’s say most people have 20k words they regularly use, 20k times 20k is 400 million, but a damn far cry from “every phrase in every book ever written”, like you said.

Plucking long word groupings out of books and articles and turning them into working cracking dictionaries is no trivial undertaking. For one thing, it requires huge amounts of disk space. Dustin works around the challenge mostly by filling up his 1TB hard drive with a list, using it to generate guesses against his uncracked hashes, wiping the drive clean, and starting all over with a new list of phrases.

One of the highlighted Ars commenters at the bottom of that article answered your question, too:

So is “correct horse battery staple” still the right type of password to use? It doesn’t seem like just stringing together a few dictionary words is sufficient any more. Surely putting together random, but common, dictionary words is in the cracker’s arsenal as well.

There are ~750,000 words in the english language. Even without substitutions, capitalizations, or weird spacing, that represents about 10^23 combinations if you picked 4 at random. You could test a billion combinations a second and finish sometime in the next 4 million years. But you said common words…

Average adult vocabulary is 20,000-35,000 words. Let’s assume that people who voluntarily test their vocabulary are probably on the high end of the bell curve in terms of word usage, and cut that low number in half. That leaves us with 10,000 words, and 10^16 ways to combine them (again if we picked just 4 at random to make our random passphrase). Generating a million hashes per second (pretty damn fast), it would take our cracker about 120 days to go through the combinations, and consume 284PB if he decides to store it as a lookup table. And that’s just from choosing 4 random commonly used words. If you went to 5, or did decided to capitalize the first and last letters, or the first letter of every word, or put a random space in there, or included a “word” made up from the first letters of all the other words (i.e., “correct horse battery staple chbs”)…well the numbers get astronomical very quickly.

The commenter was lowballing a hell of a lot on that hashes per second figure, though. Per the GRC haystack page:

Offline fast attack: 100 billion guesses per second

Nation state: 100 trillion guesses per second

One million guesses per second is pretty… quaint by today’s standards.

1,000,000 one million
1,000,000,000 one billion

So what he calls 120 days at that one million hashes/sec rate, let’s reduce by 1000 to 2.88 days, that seems pretty realistic on today’s hardware.

Also, consider the weight of number of hashes (passwords) you have. It seems reasonable that you’d beat about 50% of them with short passwords alone (assuming 8 char average password, which is nothing these days), common wordlists, common mutations, and a little brute force.

But those are small lists. You would go from:

each password has a few billion possible words + mutations

to

each password has hundreds of trillions of words, phrases, and mutations

So even if you had that 1TB hard drive full of custom phrases derived from… books? magazines? movies? TV? what’s the target again? you have expanded the effort of work by many many orders of magnitude.

You are thinking like a person that wrote an algorithm instead of a code breaker. I would suggest reading https://www.amazon.com/Codebreakers-Comprehensive-History-Communication-Internet/dp/0684831309. It can be dry and it’s long and very detailed but at the end one of the many lessons it teaches is that history is replete with algorithms that failed because of exactly the hubris you displayed in your response. “They’ll never crack my fancy algorithm because the sheer number of combinations…”

The missing variable in your equation is human habit. You missed the part about machine learning, leveraging past password cracks and general language structure. Humans have tendencies that vastly reduce that address space. For example, our tendency would be to use a clever phrase or something from a movie or book or website (e.g. ‘correct horse battery staple’) or to use a sentence that has structure. E.g., in ‘correct horse battery staple’, I noticed it only contains words that are 5-7 letters of which 3/4 are nouns and the first is a verb. It wouldn’t surprise me if other people using this approach showed a similar habit.

20k times 20k is 400 million, but a damn far cry from “every phrase in every book ever written”, like you said.

Again, I’ll point out that those posts are from four to five years ago. The computing power and storage capabilities are much greater. In addition, in the second link, only one year after the quoted 26MM word file, you will notice they used one that contained one billion words.

IMHO, you are swimming upstream with human generated passwords. There is simply too much computing power, too much motivation and the machine learning tools are too good to think it cannot rip your all lower case, all English word passwords in weeks if not days or hours.

However, how do we test this hypothesis? Much like new encryption algorithms, the only way to know for sure is to let a group of professional crackers that compete at black hat conferences have a go at your password algorithm. Only then will you know how secure your approach is.

While it’s an interesting discussion about password length, complexity, entropy etc. And while there are real enthusiasts out there that crack passwords as a hobby I think an important point is being missed. At the end of the day discussions around password complexity exist because of security. If I’m an attacker, I want your password for one of two main reasons. 1) I want to authenticate as you to some system. 2) I want to test for password reuse so that I can authenticate as you to some other system. I don’t particularly care how I do it. With Windows maybe I pass the hash. With web apps maybe I steal your session token, or backdoor the login page, or crack your password. As an attacker as long as I can access the objective data under your security context I have won. With this in mind I would argue that it’s more realistic to strive for increasing the difficulty of password cracking so much so that the attacker has to resort to other, “risky” attacks. By risky I mean more likely to be caught. And in forcing the attacker to take risks, the hope is that you catch them, understand the scope of the attackers access, and remove that access. Afterwards you would reset passwords. So again the hope, I would argue, is that passwords are resilient enough to not be cracked within the amount of time that that entire process takes.

Secondly, the whole discussion around computing power is relevant with unsalted MD5, but with slower algorithms this completely changes the discussion. Consider Jeff’s new post. 1,600 hashes per second on the latest and greatest hardware is pretty darn slow.

On top of using a slow & salted hash there is still your point about humans that choose bad passwords. Using a blacklist, even a small one , I think can be effective. All you really need is to discourage the top X % of weak passwords. Those would be the “low hanging fruit” passwords that an attacker would first crack. If an attacker spends 5 days cracking, with little to show for it, I am more than willing to bet they will switch to a different attack.

Well, I just want to share my strategy, cause I didn’t see it mentioned.
I have 3 levels of passwords.

unsecure - can be distributed to friends without harm, like pass to a coupon site… Noone will buy me coupons, even if they hack it. Yet my friends can benefit from a good deal without me exposing my stronger passwords. this is something quick to type and tell to others.

medium - something that requires some caution, like online gambling sites, utilities, etc. I generally don’t care if these got hacked, as credit cards are not stored here, but could hurt me a little if abused.

hard - email, dropbox, banking, paypal. all with 2 factor auth

This way I only have to remember 1 hard password, and even if thats stolen, i’ve got good old 2 factor auth there. Its not optimal, but human capable. So please, site hosters, don’t tell me how storng my password should be. I don’t care, if my discourse account is hacked. So what? someone will shitpost in my name. boom. I want quick access, and to be NOT forced to use my hard password, cause i will never remember 23423423 different passwords for ther 2342345 different sites I use. Let me choose my security level.

I had a bit of a sad when I realized that we were perfectly fine with users selecting a 10 character password that was literally “aaaaaaaaaa”. In my opinion, the simplest way to do this is to ensure that there are at least (x) unique characters out of (y) total characters.

It’s worth it to consider how the expected entropy will change by adding this rule. I would imagine it would remove a lot of search space right off the bat, esp. if your magic number were known. Consider: ‘aardvarkafrikaansbazaar’.