I need to parse a large CSV file in real-time, while it's being modified (appended) by a different process. By large I mean ~20 GB at this point, and slowly growing. The application only needs to detect and report certain anomalies in the data stream, for which it only needs to store small state info (O(1) space).

I was thinking about polling the file's attributes (size) every couple of seconds, opening a read-only stream, seeking to the previous position, and then continuing to parse where I first stopped. But since this is a text (CSV) file, I obviously need to keep track of new-line characters when continuing somehow, to ensure I always parse an entire line.

If I am not mistaken, this shouldn't be such a problem to implement, but I wanted to know if there is a common way/library which solves some of these problems already?

Note: I don't need a CSV parser. I need info about a library which simplifies reading lines from a file which is being modified on the fly.

Is it possible to stop the csv processing? If yes, I'd suggest you to transfer it to RDBMS.
–
OybekApr 27 '12 at 11:44

@Oybek: can you clarify that a bit? The process which is appending to the file is constantly running, and I need to analyze the data line by line constantly (with several seconds delay).
–
GrooApr 27 '12 at 11:46

I assume you have no control of the process emitting the file?
–
Dave BishApr 27 '12 at 11:49

I mean if it is possible to go offline with the processing of the CSV file, and if it is possible to spend a little bit of time for developing, then, you could change your persistent storage from csv file to database. The latter has all types of tools (triggers, stored procedures, jobs) that can notify you about any changes, with greater consistence and concurrency.
–
OybekApr 27 '12 at 11:49

I just want to note that CSV is not designed as concurrent data storage, it is rather lightweigh data transfer format, just like json or xml.
–
OybekApr 27 '12 at 11:51

3 Answers
3

First thought: Keep it open. If both the producer and the analyzer operate in non-exclusive mode It should be possible to ReadLine-until-null, pause, ReadLine-until-null, etc.

it should be 7-bit ASCII, just some Guids and numbers

That makes it feasible to track the file Position (pos += line.Length+2). Do make sure you open it with Encoding.ASCII. You can then re-open it as a plain binary Stream, Seek to the last position and only then attach a StreamReader to that stream.

I did not test it, but I think you can use a FileSystemWatcher to detect when a different process modified your file. In the Changed event, you will be able to seek to a position you saved before, and read the additional content.

Why don't you just spin off a separate process / thread each time you start parsing - that way, you move the concurrent (on-the-fly) part away from the data source and towards your data sink - so now you just have to figure out how to collect the results from all your threads...

This will mean doing a reread of the whole file for each thread you spin up, though...

You could run a diff program on the two versions and pick up from there, depending on how well-formed the csv data source is: Does it modify records already written? Or does it just append new records? If so, you can just split off the new stuff (last-position to current-eof) into a new file, and process those at leisure in a background thread:

polling thread remembers last file size

when file gets bigger: seek from last position to end, save to temp file

background thread processes any temp files still left, in order of creation/modification

Well, size of the data being appended every second is relatively small compared to the entire file size, and that's why I'd like to avoid reading it every time (it may easily be 50GB after a week of measurements). And since data is only appended, and files very large, diff is not practical. I also don't understand the part about threading: since this is a disk operation, reading will not benefit from multiple threads, it may only run slower IMO, and the step where I write the partial file to disk and then open it again also seems redundant (if I am copying it, I may as well parse it).
–
GrooApr 27 '12 at 12:23