I have a large text file that is 20 GB in size. The file contains lines of text that are relatively short (40 to 60 characters per line). The file is unsorted.

I have a list of 20,000 unique strings. I want to know the offset for each string each time it appears in the file. Currently, my output looks like this:

netloader.cc found at offset: 46350917
netloader.cc found at offset: 48138591
netloader.cc found at offset: 50012089
netloader.cc found at offset: 51622874
netloader.cc found at offset: 52588949
...
360doc.com found at offset: 26411474
360doc.com found at offset: 26411508
360doc.com found at offset: 26483662
360doc.com found at offset: 26582000

I am loading the 20,000 strings into a std::set (to ensure uniqueness), then reading a 128MB chunk from the file, and then using string::find to search for the strings (start over by reading another 128MB chunk). This works and completes in about 4 days. I'm not concerned about a read boundary potentially breaking a string I'm searching for. If it does, that's OK.

I'd like to make it faster. Completing the search in 1 day would be ideal, but any significant performance improvement would be nice. I prefer to use standard C++ with Boost (if necessary) while avoiding other libraries.

So I have two questions:

Does the 4 day time seem reasonable considering the tools I'm using and the task?

What's the best approach to make it faster?

Thanks.

Edit: Using the Trie solution, I was able to shorten the run-time to 27 hours. Not within one day, but certainly much faster now. Thanks for the advice.

@piokuc, You're right, but I think he's reading the 20GB 20,000 times, means reading through a total of about 390TB. My suggestion would be, if the available RAM is known, would be to split the file up into sizable chunks, search each chunk for the string, dump the chunk, move on. Method of how he checks for the string makes a big difference though.
–
SlxSMay 3 '13 at 14:29

1

He says he's reading 128MB chunks and doing the 20k searches in one chunk before moving to the next one, that's how I understood it.
–
piokucMay 3 '13 at 14:31

3 Answers
3

Algorithmically, I think that the best way to approach this problem, would be to use a tree in order to store the lines you want to search for a character at a time. For example if you have the following patterns you would like to look for:

hand, has, have, foot, file

The resulting tree would look something like this:

The generation of the tree is worst case O(n), and has a sub-linear memory footprint generally.

Using this structure, you can begin process your file by reading in a character at a time from your huge file, and walk the tree.

If you get to a leaf node (the ones shown in red), you have found a match, and can store it.

If there is no child node, corresponding to the letter you have red, you can discard the current line, and begin checking the next line, starting from the root of the tree

This technique would result in linear time O(n) to check for matches and scan the huge 20gb file only once.

Edit

The algorithm described above is certainly sound (it doesn't give false positives) but not complete (it can miss some results). However, with a few minor adjustments it can be made complete, assuming that we don't have search terms with common roots like go and gone. The following is pseudocode of the complete version of the algorithm

tree = construct_tree(['hand', 'has', 'have', 'foot', 'file'])
# Keeps track of where I'm currently in the tree
nodes = []
for character in huge_file:
foreach node in nodes:
if node.has_child(character):
node.follow_edge(character)
if node.isLeaf():
# You found a match!!
else:
nodes.delete(node)
if tree.has_child(character):
nodes.add(tree.get_child(character))

Note that the list of nodes that has to be checked each time, is at most the length of the longest word that has to be checked against. Therefore it should not add much complexity.

+1 this would probably be simpler to implement than Nico's suggestion (Aho-Corasick) and still yield an enormous speed improvement over the current approach. Nice explanation BTW.
–
syamMay 3 '13 at 14:44

Of course you can read in chunks at a time, you just need to check one character of the chunk at a time, which is done sequentially in RAM, so this shouldn't be heavy on IO.
–
decdenMay 3 '13 at 15:05

The problem you describe looks more like a problem with the selected algorithm, not with the technology of choice. 20000 full scans of 20GB in 4 days doesn't sound too unreasonable, but your target should be a single scan of the 20GB and another single scan of the 20K words.

Have you considered looking at some string matching algorithms? Aho–Corasick comes to mind.

Rather than searching 20,000 times for each string separately, you can try to tokenize the input and do lookup in your std::set with strings to be found, it will be much faster. This is assuming your strings are simple identifiers, but something similar can be implemented for strings being sentences. In this case you would keep a set of first words in each sentence and after successful match verify that it's really beginning of the whole sentence with string::find.