Limbic~Region has asked for the
wisdom of the Perl Monks concerning the following question:

All,
A friend recently shared a puzzle that I thought you might enjoy.

You and 7 friends are going to a costume party dressed as letters of the alphabet. Which 8 letters should you choose in order to be able to form the most words.

I asked a few clarifying questions:

Does capitalization/punctuation matter? - No

Can more than one person be the same letter? - Yes

Are words with less than 8 letters allowed? - Yes

Are words with more than 8 letters allowed (marquee style)? - No

Is there an official dictionary? - No

Your challenge, if you choose to accept it, is to solve the puzzle (provide code, link to dictionary you used, 8 letter solution as well as the number of words that can be made). If you don't have a dictionary, I tend to use the 2of12inf from the Official 12Dicts Package.

First you get the words each letter appears in (regardless of how many times it appears). Then, you consider the top 8 letters based on how many words contain the letter but only allow each letter to appear once.

Your solution would probably be a lot faster if you also threw out any word from the dictionary that had repeated letter since I can't see how you would ever consider them.

while I'm looking at the solutions presented here, I tried to understand your solution knowing that you have often a different and interesting view of looking at the problem. I did a litte test case dict:

OK, this uses some fairly naive heuristics to narrow down the search field. I'm sure somebody can do better, though I'd be fairly surprised if anybody came up with an answer that differed from mine by more than 2 letters.

The dictionary used was the standard /usr/share/dict/words supplied with Ubuntu 13.04.

The best set of letters it found was "aeinprst" which could make 426 words. (The list appears to have some duplicates because some words appear in /usr/share/dict/words more than once with different cases.)

In the right typeface, if "a","n" and "p" can do handstands, you have (another) "e", "u" and "d" available, so it's really more then 426. Let a double for e, and add "h" who can also be "y" and this could get weird.

Edit: if "p" makes the costume right, they can also be "b", "d" or "q". As far as I can tell, this doesn't violate the rules above.

All,
I have an idea for an exhaustive solution. I haven't convinced myself it will work yet but it will require multiple phases and at least one of the phases will need to be written in C. I will update when I can play even if it turns out not to be viable.

Update: Ok, I finally have something running that appears to work though I won't know for some time if it is correct. I ended up using 2 pieces of code. A short perl script and a C program that I will share despite the fact it is probably horrible.

Perl script. Used to filter the dictionary as well as determine the maximum number of times each letter is repeated in a word.

I am sure there is a much better way of doing this. I will need to achieve over 5 million loops per second to achieve my goal of being completed within 24 hours. I kicked it off at 10:28PM EST. I will update again when it is completed.

Update 2: It only took 12 minutes to run which worries me, but here is the output (winner at the bottom):

aabdellu can make 55 words
abellrsu can make 118 words
aeikprrs can make 131 words
ahopstuw can make 136 words
...
acehprst can make 288 words
adeinrst can make 313 words
adeoprst can make 316 words
aeilprst can make 344 words
aeinprst can make 346 words
Finished

Re-corrected the off-by-one error which was due to an empty line in the small test dictionaries I used.

CountZero

A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

You are right, because I check against all 8 character words and 'aaaaaa' only has six and therefore it is not considered. To be absolutely certain one would have to check against all combinations of 8 characters, i.e. checking your whole dictionary against each of 208 827 064 576 combinations.

So my solution answers the question with one extra condition: that the 8 characters together form a word too.

And I seem to have an "off by one error" somewhere.

CountZero

A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

CountZero,
What was the "off-by-one" error? Using the same dictionary file, I get aeinprst = 346 which I have manually verified. This isn't brute force (as you have indicated elsewhere) - it is heuristic but appears to come up with the correct result. While I like it, I needed to know for certain which is why I wrote an exhaustive solution which amazingly finished in 12 minutes.

Well, the off-by-one error I assumed was in my code was actually in the small test dictionaries I ran, because of a spurious empty line at the end.

It is quite funny that my brute force code finds the right solution even without going through all possible combinations. I guess I was just lucky.

CountZero

A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

I tried coming at this from another direction. Instead of looping through the possible letter combinations, I started with the word list and an empty array of "stacks" of words that could fit into the same set of 8 letters. For each word in the list, I checked it against each "stack," adding it to any stacks it could fit into, and starting a new stack of its own.

The first element in each stack is a list of the letters required to make all the words in that stack. So I'm able to compare against this one "word," rather than all the words in the stack each time. I could save some memory, and maybe some CPU, by not saving the words at all, just the number that fit in the stack, but I wanted to keep track of the words so I could verify that it's working right.

For each stack (sub-array), I do a quick check to see if the new word added to this stack would obviously require more than 8 letters, so it can bail out quickly if that's the case. Otherwise it does a more complex check to see what set of letters would be required to add this word to the stack, and does so if it fits.

In the end, it picks the largest stack. I ran it against the file 2of12.txt, skipping hyphenated words, lowering the case on everything, and removing duplicates, and it took almost exactly one hour on my system to process the remaining 21664 2-8 letter words. I'm rerunning it now against 2of12inf.txt, which I expect will take at least four times as long, maybe more. In the meantime, here's my code and results. I think there are probably some pretty big inefficiencies here, but I hope the method is interesting, at least.

aaron_baugher,
I considered a bucket approach as well - but not quite the way you did it. What I was going to do was bucket all words of the same length and then at the end roll up 2 letter words into 3 letter words, 3 letter words into 4 letter words, etc. I realized this won't guarantee the best solution because of what I have explained elsewhere in this thread (taking some from one group and some from another can lead to the best solution)

I ran your code against 2of12inf.txt (actually, the file I had already reduced). On my machine it took 2 hours and 3 minutes. Here is the output:

Yes, I had it dump the complete array to a file the last time, and the winning set "painters" only gets 292 words from my script. I had a feeling it could miss some somehow, but I can't quite figure out how yet.

Update: I see the problem now. Let's say "painters" is the 100th word, and "at" is word #2 and "not" is word #3. "not" gets added to bucket #2 along with "at" because it can fit, but now "painters" can't go into that bucket. And since "at" has already been placed in bucket #2 and possibly bucket #1, it can't be placed in bucket #100 with "painters" or in any of the buckets in between that "painters" might be in.

So to make my approach work, I guess once all the words have been placed in at least one bucket, I would have to go back through them again, retrying them against each bucket past the one they started in. And even then I'm not sure you'd be guaranteed to get a bucket with the best possible combination.

I had a nagging feeling there was a problem like that, but I needed a night's sleep to see it.

Reads the dictionary and sorts the words descending according to length, ie the 8-letter words first, then the 7-letter words etc. It also sorts alphabetically but that is not required.

It then goes through the list of words and builds classes or sets of words indexed by their common letters. More specifically, it checks all classes created so far, whether there is one having all the letters needed for the next word.

If yes, it adds it to all existing classes where it fits. It is important to add them not to one class but to all fitting classes.

If not, it creates a new one.

For this, it is important that long words are processed before short words.

Whether it fits or not, is done by removing each letter of the new word from the letters of the class. If this is possible for all letters of the new word, then it fits and will be added to the class.

After processing all words, it sorts the classes by number of members descending and prints them in that order.

It did run in 34 minutes. I'll be grateful for any hints on performance optimization.

Update: As I realized when reading Re^5: Challenge: 8 Letters, Most Words this solution will not be able to put two four letter words into one class irrespective of the letters. So in some situations it will not find the best solution.

BrowserUk,
Can you provide a link to your dictionary or, better yet, run your program against the dictionary I provided in the root of this thread? I am planning on writing a truly brute force solution and would like to have something to base it off of.

Hm. L~R: I've sat on these results for an hour now (it takes about 2 hours to run) whilst I looked for errors. I haven't seen anything obvious.

But, you should probably take them with a pinch of salt. I see the possibility for something wrong in these results, but it is too late, and takes too long to verify them tonight. I'll attempt verification tomorrow.

However, the results are interesting enough that I felt you might like to see them. The following two sets of results are from my (179k) dictionary and the (81k) 2of12int.txt. These results do allow for set of 8 letters than include duplicates, though none are in the top 50.

What is intriguing (and worrying) is that word counts are identical; but the charsets are not!

Ok, I started with 2of12full.txt, removed anything with a capital letter (I'm assuming those are abbreviations or dupes) or non-alphabet characters (hyphenations, abbreviations, etc.), and removed all words of less than 2 characters or more than 8. I then ran this:

This should be 100% accurate with any given word list (of a-z only), though it does take a little while to run, and if anyone can think of a way to speed it up without compromising accuracy, by all means do so.

Capitalization may not matter, but dupes do, and there are often multiple versions of the same word. I suppose I could have just lowercased everything and then deduped. The word list is largely irrelevant, though - the important thing is the algorithm.

What do you mean by brute force, exactly? There are far too many possible letter combinations to just check each one.

The '2of12inf.txt' already has all these filtered out, but some words have a '%' added (I do not know why, but you will have to delete the '%').

CountZero

A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

I get 670 with aeilnrst using /usr/share/dict/words and 346 with aeinprst using 2of12inf, but I'd like to expand on this. All presented solutions take as a base that the letters should be able to create at least one 8-letter word. If I however assume that a 7-letter word with a random extra letter is able to create more words with shorter words, the number of possibilities explodes.

Grab all words of eight characters or less from the dictionary, as that's all we care about.

Determine the letter frequency from the trimmed dictionary.

Split the list of words into those with eight characters, and those with fewer.

Create buckets for each unique combination of letters that represents eight-letter words. In cases where the same combination repeats, just increment the counter for that bucket.

Insert all of the buckets into a database. Each row is a bucket. Each column is a count of how many times a letter is used in that bucket. The final column is a counter for how many times the bucket is used.

Now run through all the words of length less than eight. Select from the database every bucket that a given word can fit into, and increment the counters for each of those buckets.

Find the maximum count; ie, the counter for the bucket where the most words fit.

Find the row this occurs in, and grab the letters that the bucket represents.

In the database, the letter columns are in reverse-frequency order, or frequency-ascending order. That way, 'q' is the first letter column, and so on, finishing at 'e'. The queries also use this reverse-frequency order, so that the "AND"s within the "WHERE" clause can narrow down the list of records to include as early as possible. ...at least that's my theory. ;)

The code is embarrassingly dirty, and there are several significant opportunities to further trim cycles. But I wanted to put the ideas out there in case someone else wanted to consider the methodology.

where one solution is eehnortw covering four words but your script says [1] aceesttx : exactest.

Update: After further study it looks to me, that you are checking all 8 letter classes derived from the 8 letter words in the dictionary and see how many words are captured. Which is probably giving you the correct solution for many real world dictionaries. This way, you do avoid the combinatorial explosion that makes this challenge difficult...

Simple greedy algorithm takes only a 20-45 seconds to find some of the top contenders. Does better on my large wordlist than the smaller 2of12inf.txt. Didn't bother to extend to examine more branches or perform exhaustive search on last 2 or 3 letters since it did so well.

My strategy is to aggregate up the word counts from candidates with the fewest letters to the most letters. (This approach is not thorough but is fast, see Update #2). The trick to make it fast is to realize that there is a one letter difference between candidate tiers, and to cycle through 26 letters for subset matching.

My winner is aeilprst with it being able to make at least 343 words (using all 8 letters, or a subset of that). This may not be the best answer because of the simplistic aggregation, but it points to the likely champs.

Looking at the other replies makes me question myself. What am I doing wrong?

Other results have winners that create only hundreds of words. Seems strange, given that 40k+ words were processed. Is that for words using all 8 letters only (and not a subset of the letters)?

Update: I see what's wrong now. The aggregation has to keep track of contributing counts of specific candidates. Back to drawing board.

Update #2: I have updated the script and results, which is more in line with what others have gotten. The total word counts do come up short because of the aggregation strategy. But the tradeoff in speed is significant.