I'm working on a project in which I need to read text (source) file in memory and be able to perform random access into (say for instance, retrieve the address corresponding to line 3, column 15).

I would like to know if there is an established way to do this, or data structures that are particularly good for the job. I need to be able to perform a (probably amortized) constant time access. I'm working in C, but am willing to implement higher level data structures if it is worth it.

My first idea was to go with a linked list of large buffer that will hold the character data of the file. I would also make an array, whose index are line numbers and content are addresses corresponding to the begin of the line. This array would be reallocated on need.

Subsidiary question: does anyone have an idea the average size of a source file ? I was surprised not to find this on google.

To clarify:

The file I'm concerned about are source files, so their size should be manageable, they should not be modified and the lines have variables length (tough hopefully capped at some maximum).

The problem I'm working on needs mostly a read-only file representation, but I'm very interested in digging around the problem.

If you just need read-only, just get the file size, read it in memory with fread(), then you have to maintain a dynamic array which maps the line numbers (index) to pointer to the first character in the line. Someone below suggested to build this array lazily, which seems a good idea in many cases.

6 Answers
6

I'm not quite sure what the question is here, but there seems to be a bit of both "how do I keep the file in memory" and "how do I index it". Since you need random access to the file's contents, you're probably well advised to memory-map the file, unless you're tight on address space.

I don't think you'll be able to avoid a linear pass through the file once to find the line endings. As you said, you can create an index of the pointers to the beginning of each line. If you're not sure how much of the index you'll need, create it lazily (on demand). You can also store this index to disk (as offsets, not pointers) if you will need it on subsequent runs. You can estimate the size of the index based on the file size and the expected line length.

NOTE: this mechanism will index about 50MB/s on a desktop PC, basically the disk bandwith.
–
wildplasserSep 2 '11 at 10:25

Yes, for read-only (or not-so-oftenly-modified) data structure. And file has to be really small to fit whole into memory.
–
Rok KraljSep 2 '11 at 10:58

1

I have never seen an 100MB source file. And in that case, mmap() (if available) could do the trick. Next step would be bufferwise, scan the buffer for '\n', rinse, repeat.
–
wildplasserSep 2 '11 at 11:50

It's impossible to make insertion, deletion, and reading at a particular line/column/character address all simultaneously O(1). The best you can get is simultaneous O(log n) for all of these operations, and it can be achieved using various sorts of balanced binary trees for storing the file in memory.

Of course, unless your files will be larger than 100 kB or so, you're probably best off not bothering with anything fancy and just using a flat linear buffer...

solution: If lines are about same size, make all lines equally long by appending needed number of metacharacters to each line. Then you can simply calculate the fseek() position from line number, making your search O(1).

If lines are sorted, then you can perform binary search, making your search O(log(nõLines)).

If neither, you can store the indexes of line begginings. But then, you have a problem if you modify file a lot, because if you insert let's say X characters somewhere, you have to calculate which line it is, and then add this X to the all next lines. Similar with with deletion. Yu essentially get O(nõLines). And code gets ugly.

If you want to store whole file in memory, just create aray of lines *char[]. You then get line by first dereference and character by second dereference.

As an alternate suggestion (although I do not fully understand the question), you might want to consider a struct based, dynamically linked list of dynamic strings. If you want to be astutely clever, you could build a dynamically linked list of chars which you then export as strings.

If you build suitable helper functions for malloc and delete, and make it so the structs can either delete themselves automatically or manually. Using the above combinations won't get you O(1) read in (which isn't possible without the files have a static format), but it will get you good time.

If you know the file static length (at least individual line wise), IE no bigger than 256 chars per line, then all you need is the DynamicListOfArries - write directly to the array (preset to 256), create a new one, repeat. Downside is it wastes memory.

Note: You'd have to convert the DynamicListOfArrays into a 'static' ArrayOfArrays before you could get direct point-to-point access.

If you need source code to give you an idea (although mine is built towards C++ it wouldn't take long to rewrite), leave a comment about it. As with any other code I offer on stackoverflow, it can be used for any purpose, even commercially.

Based on actual use by programmers in the real world, what is the average file size of a source file ? It is a statistical thing, much like the current average age of a human being.
–
NorswapSep 3 '11 at 23:24

Well, thats what i was talking about, about a production environment, for example, you have some source files containg just a few hundred lines whereas others could contain like thousands of lines
–
MansuroSep 4 '11 at 6:21

for more concrete examples, you can check the linux source code
–
MansuroSep 4 '11 at 8:18

Imagine you could access all source code files in the world that are being directly edited by human beings. What I seek is an estimation of the average file size in this corpus.
–
NorswapSep 4 '11 at 14:38