Longest common substring

Ok let's say I need a very little push with an assignment. For now, I had to read relatively large files and sort them into a list according certain criteria (not relevant). Done that.

Now I need to compare two of these lists and find the longest common sub-string which exists in both lists, in fact, I have to return the length of the longest common sub-string.

This is the code that I have so far, it works well for long enough text but it surely doesn't do the trick with the files i'm trying to deal with. Could anyone please give a hint how I could improve the time complexity?

It might (and I stress, might) be marginally faster if you looped backwards: range(len(D),0,-1)

In fact it is looped backwards... I set it that way so that the first match found is actually the maximal match. My problem occurs due to recurring loops, I'd like a suggestion of any possible conditions where I could simply break the loop and skip to the next one...

My test strings are so long that I didn't wait for either of our functions to finish. My inputs were text from public domain pdf versions of
A_Christmas_Carol_NT.txt Anna_Karenina_NT.txt

The result so far starting at character position 33 is the supremely disappointing (but expected, I've looked at these files before today)

This eBook is designed and published by Planet PDF. For more free
eBooks visit our Web site at http://www.planetpdf.com/.

Here's my loopstatus.py which is useful for ocassional reporting during long running, otherwise silent loops.
This eBook is designed and published by Planet PDF. For more free
eBooks visit our Web site at http://www.planetpdf.com/.

A 33
14:22:52 INFO L25 character 63 of 162805
14:23:53 INFO L25 character 127 of 162805
14:25:57 INFO L25 character 255 of 162805
14:30:14 INFO L25 character 511 of 162805

Lists are not efficient for any data that is "large". Unfortunately "large" is not a concrete number. To access the lists, let us say you are comparing the 100th element of each, you
start at the beginning of list_1 and go to the 100th element
start at the beginning of list_2 and go to the 100th element
and if they are equal
start at the beginning of list_1 and go to the 101st element
start at the beginning of list_2 and go to the 101st element, etc.

You should try with a set or dictionary since they are hashed. The example below uses a set. Also, you can eliminate around 90% of the lengths used to generate the sub-strings by using the divide by two method. Divide the longest length by two and test that length for sub-strings that match. If not found, divide the lower half by two and test for 1/4 the longest length. If found, divide the upper half, but either way you have eliminated half of the possible lengths. Continue dividing by two until you get to a manageable number of lengths remaining, say 10 or 20, start at the lowest number and increment by one until a match is not found. If you want more help with this, post some test data that we can use along with any problems.

I tried dwblas functions (without the demonstration print statements) for the aforementioned strings of 0.1 and 2.1 million characters, which immediately sent my computer into page fault convulsions.

Still, hashing of some sort might produce a faster algorithm.

Extrapolating, my code would have finished the job in a little under 2 days. No estimates available for the other algorithms.

Yeah, I did encounter the same problem as well. The given functions run a little faster than mine does but still it isn't fast enough. My assignment page says that the program must run under 2 minutes to be accepted. I am starting to doubt my self, the text files are immense I could swim in them, I just can't see how this can be achieved as requested...

You could compute a running hash, and then comparing only 1 byte would tell you if there's a chance the strings could match, only then would you need to use strcmp.

1 2 3 4 2

The running hash, say I use sum, of length 3

1+2+3 starting at position 0
(6-1)+4 starting at position 1
(9-2)+2 starting at position 2

I'd use exclusive or instead of sum, but whatever. It might be part of a faster algorithm.

Of course one can build faster searches, but there's some overhead in constructing the tables.
find "abca" in "ababcax"
"a" matches "a"
"b" matches "b"
"c" does not match "a" but our precomputed table says that there is an "a" two characters before "c" in the pattern, so shift two characters to the right.
etceteras.

Instead, maybe the trick is to remove that inner loop, replace it with built-in (c level) code. I'll try this next.

There is no way anyone can help without something more specific than "immense".

Now I need to compare two of these lists and find the longest common sub-string which exists in both lists

Are they lists or files? Do they contain one long string, if so how long, or do they contain many records and each record has to be checked against many other records, if so approximately how long is the longest record.

I had to read relatively large files and sort them into a list according certain criteria (not relevant). Done that.

This implies that they will fit into existing memory, so a solution which holds the entire files/lists in memory is called for.

I tried dwblas functions (without the demonstration print statements) for the aforementioned strings of 0.1 and 2.1 million characters, which immediately sent my computer into page fault convulsions

Obviously some assumptions are being made which don't apply since the lists have been sorted previously by the OP.

There is no way anyone can help without something more specific than "immense". Are they lists or files? Do they contain one long string, if so how long, or do they contain many records and each record has to be checked against many other records, if so approximately how long is the longest record. This implies that they will fit into existing memory, so a solution which holds the entire files/lists in memory is called for. Obviously some assumptions are being made which don't apply since the lists have been sorted previously by the OP.

Ok it is time to specify more.
The examined text is found inside a file which contains a one piece text formed of approximately 3 million chars without spaces. As a starting point I have to import the text into a LIST inside python and sort the text so that in each index of the list there's a 800+- long sub-string on average.

In total I need to create two of these lists (2 files imported). Basically I need to find the longest sub-string which is in common in both lists, but I think that's clear. So I have no choice but run on all the indexes in both lists and check for longest common string...

My problem is that no matter how hard I try to make my algorithm faster, I always lose against my lecturer's challenge of finishing the computing task in less than 2 minutes...

I am sure there's some sort of trick involving "memoization" and something about Karp-Rabin algorithm of converting text to int and compare... I can post my attempt at that if you want to see? But I repeat, it is nowhere near efficient...

I did not thoroughly test the program I submitted. Does it give correct answer?

Also, I'm pretty sure it's faster if the arguments are in the order of (shorter_string, longer_string)
If so you could sort the arguments by length in the function.

Well I am about to finish the project!
It is still quite inefficient when talking about the 3 millions char string but I am on my way to getting it done. Do you have any suggestions for a halting condition? Maybe some way to stop break the loop and prevent the program from comparing useless sub-strings?