i have this assignment for school which ask us to write code to find the longest common Substring. I have done that, but it only works with text that are not so big and it is being asked to find the common substring for Moby Dick and War And Peace. If you could point me in the right direction of what i'm doing wrong, i would appreciate it. The compiler is complaining that the error is in the substring method of the MyString class when i call it to create the SuffixArray but idk why its saying its too big, giving me the outofmemory

4 Answers
4

You are trying to store, not only the entire long string, but the entire string - 1 letter, and the entire string - 2 letters, etc. All of these are stored separately.

If your String were only 10 letters, you would be storing a total of 55 characters worth in 10 different string.

At 1000 characters, you are storing 500500 characters total.

More generally, you are having to handle, length*(length+1)/2 characters.

Just for fun, I don't know how many characters are in War and Peace, but with a page count around 1250, a typical words/page estimate being 250, and the average word being about 5 characters long, comes to:

I count at least 3 copies of both texts in the first few lines of the code. Here's a few ideas

convert the spaces as you read each line in--not after they are huge strings. Don't forget the case of spaces at the front and end of lines.

build your MyString class using StringBuilder as the base instead of String. Do all the looking inside the StringBuilder with its native methods, if you can.

don't extract strings any more than you have to.

Look up the -Xmx java runtime option and set the heap space large than the default. You'll have to google this as I don't have it memorized. Just notice that -Xmx=1024M needs that M at the end. (Look at the file size to see how big the two books are.)

When you construct MyString, you call arr = str.toCharArray(); which makes a new copy of the string's character data. But in Java, a string is immutable - so why not store a reference to the string instead of a copy of its data?

You construct every suffix at once, but you only refer to one (well, two) at a time. If you recode your solution to only reference the suffixes it currently cares about, and construct them only when it needs them (and lose a reference to them afterwards), they can be garbage collected by Java. This will make running out of memory less likely. Compare the memory overhead of storing 2 strings to storing hundreds of thousands of strings :)