I bet somebody has solved this before, but my searches have come up empty.

I want to pack a list of words into a buffer, keeping track of the starting position and length of each word. The trick is that I'd like to pack the buffer efficiently by eliminating the redundancy.

Example: doll dollhouse house

These can be packed into the buffer simply as dollhouse, remembering that doll is four letters starting at position 0, dollhouse is nine letters at 0, and house is five letters at 3.

What I've come up with so far is:

Sort the words longest to shortest: (dollhouse, house, doll)

Scan the buffer to see if the string already exists as a substring, if so note the location.

If it doesn't already exist, add it to the end of the buffer.

Since long words often contain shorter words, this works pretty well, but it should be possible to do significantly better. For example, if I extend the word list to include ragdoll, then my algorithm comes up with dollhouseragdoll which is less efficient than ragdollhouse.

This is a preprocessing step, so I'm not terribly worried about speed. O(n^2) is fine. On the other hand, my actual list has tens of thousands of words, so O(n!) is probably out of the question.

What you're describing is what all compression algorithms do, except you're adding the constraint of looking at plain text words as the elements being compressed rather than bits.
–
Richard NicholsMay 10 '09 at 13:44

2

It's not quite the same as compression algorithms, because each word must maintain its "wordiness". Like I said in another comment, you can't combine "lawman" and "woman", but in compression, it'd be fine to compress "man" together because you don't need to maintain one consistent buffer.
–
Daniel LewMay 10 '09 at 13:46

Also, FWIW, the solution should be able to capitalize on multiple suffix and prefix matches. So if my wordlist had "lawman", "woman", "manage" and "mangle", it should be able to form "lawmanage" and "womangle".
–
Daniel LewMay 10 '09 at 13:47

Daniel Lew is on the right track. I'm looking for packing, not compression. Maybe I'll just use a genetic algorithm to find a decent packing.
–
Adrian McCarthyMay 10 '09 at 16:47

8 Answers
8

This is the shortest superstring problem: find the shortest string that contains a set of given strings as substrings. According to this IEEE paper (which you may not have access to unfortunately), solving this problem exactly is NP-complete. However, heuristic solutions are available.

As a first step, you should find all strings that are substrings of other strings and delete them (of course you still need to record their positions relative to the containing strings somehow). These fully-contained strings can be found efficiently using a generalised suffix tree.

Then, by repeatedly merging the two strings having longest overlap, you are guaranteed to produce a solution whose length is not worse than 4 times the minimum possible length. It should be possible to find overlap sizes quickly by using two radix trees as suggested by a comment by Zifre on Konrad Rudolph's answer. Or, you might be able to use the generalised suffix tree somehow.

I'm sorry I can't dig up a decent link for you -- there doesn't seem to be a Wikipedia page, or any publicly accessible information on this particular problem. It is briefly mentioned here, though no suggested solutions are provided.

I believe that only works with strings that start with common substrings. Strings that end with common substrings will not be recognized. Correct me if I'm wrong.
–
ZifreMay 10 '09 at 13:31

1

If strings end with a common substring, they wouldn't be matched up anyways based on this description. Doing so would cause the individual strings to become messed up.
–
Daniel LewMay 10 '09 at 13:41

To elaborate, if you had "woman" and "lawman", you cant combine them even if you wanted to. The only way combination works (as I understand the problem) is if a suffix of one word matches a prefix of another.
–
Daniel LewMay 10 '09 at 13:43

My first thought here is: use a data structure to determine common prefixes and suffixes of your strings. Then sort the words under consideration of these prefixes and postfixes. This would result in your desired ragdollhouse.

What you are suggesting sounds like it could be implemented with a double radix tree (one forward and on backward). This would work in most cases, but if the strings have common parts in the middle, but not on the edges, it won't work.
–
ZifreMay 10 '09 at 13:34

For an example, it wouldn't recognize consuming and sum.
–
ZifreMay 10 '09 at 15:48

Could you just explain to us the link with the Knapsack Problem?
–
akappaMay 10 '09 at 15:14

The Knapsack problem (optimally packing some goods in a bag) looked similar to me. In fact (see j_random_hacker's answer) this is a NP-complete problem, like the Knapsack one.
–
friolMay 10 '09 at 15:27

Yes, but I still can't see the similarity of that problem with the KP. 3-SAT is NPC, but I can't certainly say that it is similar to that "string packing" problem.
–
akappaMay 10 '09 at 15:34

The "bag" is the string with the shortest length (the "optimally packed" one). Packing the goods into the bag is similar to adjusting the substrings in the "main" one: in both cases you have constraints (substring constraint or total weight limitation).
–
friolMay 10 '09 at 15:42

Huffman encoding: A form of entropy encoding that constructs a variable-length code table in which shorter codes are given to frequently encountered symbols and longer codes are given to infrequently encountered symbols

I'm not after packing, not compression. At run-time, I want the full text of each word readily accessible. I could do that without any sort of packing, but I recognized that packing could give me a significant reduction in footprint and improved locality of reference.
–
Adrian McCarthyMay 10 '09 at 16:42

how is your packing & unpacking different from any other compression and decompression algorithm?
–
martinusMay 11 '09 at 11:42

With compression, you have to decompress. With packing as I've described, there's no unpacking required. I have the full text of the original words directly available.
–
Adrian McCarthyMay 11 '09 at 17:34

Do you want a data structure that lets to you store in a memory-conscious manner the strings while letting operations like search possible in a reasonable amount of time?

Do you just want an array of words, compressed?

In the first case, you can go for a patricia trie or a String B-Tree.

For the second case, you can just adopt some index compression techinique, like that:

If you have something like:

aaa
aaab
aasd
abaco
abad

You can compress like that:

0aaa
3b
2sd
1baco
2ad

The number is the length of the largest common prefix with the preceding string.
You can tweak that schema, for ex. planning a "restart" of the common prefix after just K words, for a fast reconstruction

Note that, with the last schema, you should compress much more than a packing like you've suggested. Of course you can't just have one pointer to the word, but a tuple (pointer to the first word with 0 prefix, offset)
–
akappaMay 10 '09 at 15:36

I'm not looking for a compression method. I need fast random-access to the full text of each word, so I don't want to decompress on the fly. Packing reduces the memory footprint and improves locality of reference.
–
Adrian McCarthyMay 10 '09 at 16:44

Are you sure that it improves locality? Locality depends largely upon the order wich you request words, not only the memory footprint (except edge cases, of course). And are you really sure that it improves largely the memory footprint? It seems to me that this optimization can be a good thing if you have a particular set of strings, but it's pratically useless on, for ex., natural language words.
–
akappaMay 10 '09 at 18:12