I have a text file with about 100,000 lines (5 MB), which is updated once a day. It grows at a rate of about 30 lines a day. The lines are not sorted in any way. Each line is 50 hex characters long and looks like this:

ABCDE9DAF1F66C10C02F25A1685821F8428422F5870F39A3FE

Given one of these strings, I need to figure out if it exists in this file. I am working with C# (.NET CF 2.0) on a handheld device, so memory is limited. I have the ability to process the file before hand on a Windows server. What is the fastest way to do this? Some of my initial ideas include: sorting the file, line by line string compare, creating a binary file to search, or using SQLite.

From OP's comments (an important one, which was left out from the question initially):

The file is read-only. No changes will
ever be made by my programs. I get a
new version of the file each day with
more strings appended to the end

You could halve the size of the file by treating the string as an integer. Since you are looking for 'best', is there any distribution the strings follow? For instance, does a string starting with "A" occur more frequently than a string starting with "E" etc...
–
AryabhattaFeb 28 '11 at 20:49

The strings are 100 bits long, how can I treat them as integers? I have no idea what strings could be added to the file in the future, so I don't want to design a clever search technique that depends on a certain distribution when that could change in the next few months.
–
BenFeb 28 '11 at 20:58

ABCD takes 4 bytes as a string, but 2 bytes as an integer, but it seems like you are already storing them that way since you talk about 100 bits, instead of 200.
–
AryabhattaFeb 28 '11 at 21:07

5 Answers
5

Optimal way to do this would probably be to pre-sort the file on the server, and use memory mapped files to do a binary search of the file. That being said, .NET CF 2.0 won't have support for memory mapped files.

You're probably best off just pre-sorting the file, and using stream access to perform a binary search on the file. It's not great because you don't have sequential reads, but seeing as you're on CF, there is a good chance your data store on the device is flash based, so the random access for the binary search probably won't be too bad...

Another idea might be to keep the main file in unsorted order and use a secondary file to check for duplicates.

Have a format that uses a small hash value and a fixed number of offset values. The hash value is the offset in the secondary file. From that offset is the array of offsets in the primary file. When any hash array fills up, you'd need to recalculate using a bigger hash value and a bigger secondary file. Or you could use some trick like a cuckoo hash.

Keep the file sorted on the server ((c) LorenVS), but do the binary search directly on the file by using the record length (50 hex characters + 2 for Cr Lf) to move the file pointer (seek) to the mid positions and read the strings to compare to. That should minimize the memory needed on the device.

if your application is running and has to prevent to append a duplicated string to the existing file, you could have the whole file content in memory in an hashtable or in a sortedlist. When you start your application you could optimize the loading of this collection in another thread to keep your UI responsive.

Consider that even using SQLite or SQL CE you then have the footprint of the embedded database, and I think 5 Mb don't scare anybody anymore nowadays.