Imagine I create a passphrase using random dictionary words but a common grammatical structure (eg: article, noun, verb, adjective, noun). Given a pretty small dictionary of 5000 words, how vulnerable are such phrases compared to:

Are there known ways to attack such passphrases beyond checking for common words first in a brute force attack? Can you think of a better way than this?

How effective would Markov Chains or n-grams be based on large word corpus's (eg: Google Web or Brown)? My gut feel is the 'nonsense' nature of the phrases would mitigate many common chains of words, but is there any research or off-the-shelf cracker I could use to verify that?

Full disclosure: I'm developing a passphrase generator which uses this grammatical technique to assist memorisation, but still keeping the words used random. I'm interested in some critique of my algorithm, while still fitting within the Q&A style of Stack Overflow.

1 Answer
1

The major finding of the paper you linked is that grammatical sentences do not have the entropy you'd expect from sentences of that length. A reasonable response to that would be to just not count the words you're adding to make the sentence 'correct'; choosing one of your examples a harpoon might expertly staff our ratios should be considered about as strong as harpoon expertly staff ratios.

If you're looking for maximum password strength; you should also avoid words that follow grammatical rules, from the paper:

Grammatical structures, or tag-rules, split the password search space unevenly; the size of the search space of individual tag-rules are diﬀerent e.g. the size of “Noun Noun” is greater than the size of “Adjective Noun”.

Implying that you should avoid 'adjective noun' to maximise your entropy. This is true even if your algorithm generates them randomly, as the structure functions to compress the password. It's also worth noting that the paper didn't look for pairs that made sense together (watery scissors would have been tried).

The point in the paper of choosing common words first wouldn't decrease its effectiveness against random, but approximately correctly formed, sentences; you can always design your dictionary to distribute words randomly by frequency. It's not so much that they try common words first, but that they try groups that are often seen together, together.

There has also been some research on the Markov model you suggest; it may be worth running that sort of analysis over the password you generate before presenting it to the user?

Thanks for the comments, Bob. I'm looking to quantify the loss of entropy with grammatical structures, but I'll make a separate question so I have enough space to express my logic / math for that. I'll also look into Google's NGrams as a sample Markov style attack, which may also mitigate against words commonly seen together.
–
ligosFeb 14 '13 at 6:30