java unscrambler

Have written an unscrambler in java. This takes in a scrambled word for input and returns the unscrambled word. The way it works is that it creates all permutations of the provided word (I have borrowed the permutation generating code from daniweb i guess) and compares each word to an array formed by parsing a dictionary and reports all matches.
Please find attached the zip file(Unscrambler) containing all files. One problem that this one has is the fact that it tries to dump all permutations in an array, which results to an out of memory error when the word is sufficiently large. Any suggestions to overcome this problem would be appreciated.

Firstly I guess you need not dump all generated permutations into an array. You should only dump those permutations which generate a word out of the unscrambling of the original word, which you can figure out by comparing each genarated word with the list of all dictionary words that you have gathered.

Searching each generated word with a long list of existing words (which might be in thousands) is itself a time consuming task, but you can have the dictionary words put into an ArrayList and then compare each generated word using the contains() method of that class, so I guess this could be much faster than iterating over the words.

Here in this algorithm another very important decision is to generate as few unscrambled words as possible. So in this sense you should try to able to discard a combination of the letters, which is not going to form a word, as early as possible. For eg. in the example you've given of 'time', you should be able to detect that there are no words of length four starting with 'tm' and hence shouldn't move further into generating the permutations starting with this combination. The more longer the word the earlier the better. So if suppose you have been given a scrambled combination of the letters of the word 'difficult' and you come to know that there are no nine lettered words starting with the two lettered combination 'df', you can eliminate the generation of all the arrangements of the rest of the seven letters where the word starts with 'df' then you would be able to save yourself the generation time of 5040 (7!) such words, which is a huge advantage. So work on these lines, and keep posting your enhancements this looks to be a promising task.

Alright that advice comes from someone who has done this program first hand, whereas I haven't had such experience, the optimizations I suggested were some of several that came to my mind upon an initial read of your problem statement, so you know who to bet your money on.

@OP ignore this, I am asking a question relevant to this thread rather than giving you advice here. .

Searching each generated word with a long list of existing words (which might be in thousands) is itself a time consuming task, but you can have the dictionary words put into an ArrayList and then compare each generated word using the contains() method of that class, so I guess this could be much faster than iterating over the words.

Don't you have a dictionary stored in alphabetical order? You don't need to compare anything to every possible word, just check alphabetically to see if it is there period. Right?

Comments

VernonDozier:Yes, the fact that it is an ordered list will significantly cut down the search time.

For sure, don't generate every permutation. I am wondering if a hash function could be useful for this. Don't generate any permutations. Get a good hash function for the scrambled words. A hash function that could certainly be improved upon would be this: a= 0, z = 25, add the values of the letters, so

tae = 19 + 0 + 4 = 23

eat, ate, and tea also have this hash function. So when you calculate the hash (23 in this case), have a map where 23 is the key and it maps to the set of all words that also hash to 23. Then see if any of them have the same letters.

A word like "lie" also maps to 23, so the hash function can definitely be improved. The point is that there won't be the whole dictionary to search, just a much smaller set. It'll take a significant time to set up the map, but it only needs to be done once and after that, searches will be extremely fast, so this method is best for a program where you aren't generating the entire map every time you search for a word. If you can figure out a hash function where ONLY words with the same letters map to the same value, that's perfect. There will be nothing left to check.

Regardless, I agree with BestJewSinceJC; take advantage of searching an ORDERED (i.e. alphabetized) list and cut your search time down from O (n) to O (log n).

Well, the dictionary words are in a sorted order and I do a binary search for each word generated by the permutation, instead of a linear search. This has resulted in a significant improvement in the search times. The actual problem lies with the permutation generation algorithm which craps out with an "Out Of Memory Error" if a large word, say with 10 letters is passed in. I had initially considered comparison to be performed for each permutation generated, but this would result to an overhead of either a function call or an object interaction for every generated permutation. I guess the best way could be generate permutations and perform comparisons to collective chunks... say for eg, generate 10000 permutations and let the comparison function consume the generated permutations, after it is done, notify the permutation generator to generate the next 10000 permutations and so on. Please let me know any other suggestions.

Vernon, your idea seems very good. Please correct me if I am wrong.
This would save me from generating any permutations at all.
Just need to have a hashmap with the simple hashing function and extract all words from the dictionary with same letters and length in the same bucket. Now, when I get the scrambled input, i just need to get the hash of the scrambled word, figure out which bucket the word falls in, and print out all the words appearing in that bucket. Have I understood u correctly?

Not sure how quick Vernon's method would be but I know mine to be pretty fast.

Basically, you sort your scrambled word into alphabetical order.

For example "lorac" would be "aclor". We then search our dictionary for any five-letter words that also sorted to the pattern "aclor". (I used a list of about 100,000 words and found four solutions: "calor", "carlo", "carol", and "coral"). This is essentially, the same as using a letter frequency algo, like the one I posted. Which you can test by using (if you didn't know already):

java -jar Yay.jar

You can shave off time if you group the dictionary by order of word length.

So, if you want to find a five letter scrambled word your program would just search for all the words that a five letters long and it would skip all the rest.

This algorithm has no space complexity issues. In other words, it would take no extra time to unscramble a 9 letter word as opposed to the 3 letter word.

Don't you have a dictionary stored in alphabetical order? You don't need to compare anything to every possible word, just check alphabetically to see if it is there period. Right?

But even then you are comparing the generated word with some words having the starting letters same as it has, this certainly takes up time more so if you are comparing a common start for e.g. 'co' there might be so many words in th dictionary count,common,complex,complexity,compare,comparing,complexion and so on, what then you are still making comparisions which are necessarily a waste of time. The contains method will be a sure way of being fast, with all left upto the implementation of the Collection class, which certainly is much faster.

The contains method will be a sure way of being fast, with all left upto the implementation of the Collection class, which certainly is much faster.

I disagree. The fastest possible way to search an alphabetically sorted dictionary would still result in comparing to unwanted results (if you're using a List). And if you just used the default contains method without overriding it, it would just do a sequential (log(n)) search of the List, which would take much longer.

Not sure how quick Vernon's method would be but I know mine to be pretty fast.

Basically, you sort your scrambled word into alphabetical order.

For example "lorac" would be "aclor". We then search our dictionary for any five-letter words that also sorted to the pattern "aclor". (I used a list of about 100,000 words and found four solutions: "calor", "carlo", "carol", and "coral"). This is essentially, the same as using a letter frequency algo, like the one I posted. Which you can test by using (if you didn't know already):

My idea has a significant time overhead in creating the map and the sets and stuff like that, but will pay off if you do a lot of searches since those will be a lot faster than searching through the entire dictionary. The key point is that setting up the map for the entire dictionary only needs to occur once. Your idea is very similar to mine and I'm starting to like it better than mine. Your idea of simply alphabetizing the letters and having THAT be the key makes it unnecessary to develop a good hash function, which might be a strong argument in favor of it.

"aclor" maps to {"calor", "carlo", "carol", "coral"}

The hash method requires, at the very least, that "aclor", "calor", "carlo", etc. all hash to the same value (let's say it's 12345). Might be best to just skip the hash function altogether and have "aclor" be the key.

So the user types in "lorac", you have a function that changes that to "aclor", and "aclor" maps the four words listed above.

Vernon, your idea seems very good. Please correct me if I am wrong.
This would save me from generating any permutations at all.
Just need to have a hashmap with the simple hashing function and extract all words from the dictionary with same letters and length in the same bucket. Now, when I get the scrambled input, i just need to get the hash of the scrambled word, figure out which bucket the word falls in, and print out all the words appearing in that bucket. Have I understood u correctly?

See above. I'm pretty rusty with my implementation of Java Maps and HashMaps. I was thinking more of the overall broad concept. I was definitely thinking that you needed to specify your OWN hash function. Basically, as above, I was assuming that "locar" would hash to some number like 12345, and then that would map to (preferably only) all the words that shared the exact same letters. If you couldn't guarantee a perfect hash function in that regard, you'd have to do further tests on the set that map to 12345 since you may not be able to assume that you don't have "dog" or something else that maps to 12345 as well. You couldn't say that "locar" unscrambled to "dog". Having the key be "aclor" makes this last concern unnecessary.

But almost definitely I think that generating all of the permutations is not the way to go. But you say you are getting an out of memory error, which suggests that you are not only generating all of the permutations, but that you are also STORING all of them, including the ones that aren't in the dictionary, which seems counterproductive.

In Vernon's example, you'd likely have multiple sets of words hash to the same location, but it would still be a lot better than just using an ArrayList. You could use a chained hash table where each index has a linked list and any collisions just go into a linked list at that index.

Yeah, I've read stuff to do with Vernon's idea and creating hash tables etc. It probably has a better time complexity however, my idea is really really simple, and you've seen it doesn't take long at all to not only find the scrambled word but find every possible smaller word that it can be made of.

Time issues only start to rise when the dictionary file is impossibly huge.

I have a 2d matrix with dimension (3, n) called A, I want to calculate the normalization and cross product of two arrays (b,z) (see the code please) for each column (for the first column, then the second one and so on).
the function that I created to find the ...

Write a C program that should create a 10 element array of random integers (0 to 9). The program should total all of the numbers in the odd positions of the array and compare them with the total of the numbers in the even positions of the array and indicate ...