How to parse files that cannot fit entirely in memory RAM

I have created a framework to parse text files of reasonable size that can fit in memory RAM, and for now, things are going well. I have no complaints, however what if I encountered a situation where I have to deal with large files, say, greater than 8GB(which is the size of mine)?
What would be an efficient approach to deal with such large files?

Chances you are parsing the file line-by line. So read in a large block (4k or 16k) and parse all the lines in that. Copy the small remainder to the beginning of the 4k or 16k buffer and read in the rest of the buffer. Rinse and repeat.

For JSON or XML you will need an event based parser that can accept multiple blocks or input.

First of all I wouldn't suggest holding such big files in RAM but instead using streams. This because buffering is usually done by the library as well as by the kernel.

If you are accessing the file sequentially, which seems to be the case, then you probably know that all modern systems implement read-ahead algorithms so just reading the whole file ahead of time IN RAM may in most cases just waste time.

You didn't specify the use-cases you have to cover so I'm going to have to assume that using streams like

std::ifstream

and doing the parsing on the fly will suit your needs. As a side note, also make sure your operations on files that are expected to be large are done in separate threads.

The concept of maximum and available memory are not so evident: technically, you are not limited by the RAM size, but by the quantity of memory your environment will let you allocate and use for your program. This depends on various factors:

What ABI you compile for: the maximum memory size accessible to your program is limited to less than 4 GB if you compile for 32-bit code, even if your system has more RAM than that.

What quota the system is configured to let your program use. This may be less than available memory.

What strategy the system uses when more memory is requested than is physically available: most modern systems use virtual memory and share physical memory between processes and system tasks (such as the disk cache) using very advanced algorithms that cannot be describe in a few lines. It is possible on some systems for your program to allocate and use more memory than is physically installed on the motherboard, swapping memory pages to disk as more memory is accessed, at a huge cost in lag time.

There are further issues in your code:

The type long might be too small to hold the size of the file: on Windows systems, long is 32-bit even on 64-bit versions where memory can be allocated in chunks larger than 2GB. You must use different API to request the file size from the system.

You read the file with an series of calls to fgets(). This is inefficient, a single call to fread() would suffice. Furthermore, if the file contains embedded null bytes ('\0' characters), chunks from the file will be missing in memory. However you could not deal with embedded null bytes if you use string functions such as strstr() and strcpy() to handle your string deletion task.

the condition in while (ptr = strstr(ptr, pattern)) is an assignment. While not strictly incorrect, it is poor style as it confuses readers of your code and prevents life saving warnings by the compiler where such assignment-conditions are coding errors. You might think that could never happen, but anyone can make a typo and a missing = in a test is difficult to spot and has dire consequences.

you short-hand use of the ternary operator in place of if statements is quite confusing too: outputfile ? fp = fopen(outputfile, "w") : fp = fopen(filename, "w");

rewriting the input file in place is risky too: if anything goes wrong, the input file will be lost.

Note that you can implement the filtering on the fly, without a buffer, albeit inefficiently:

An alternative solution: If you're on linux systems, and you have a decent amount of swap space, just open the whole bad boy up. It will consume your ram and also consume harddrive space (swap). Thus you can have the entire thing open at once, just not all of it will be on the ram.

Pros

If an unexpected shut down occurred, the memory on the swap space is recoverable.

RAM is expensive, HDDs are cheap, so the application would put less strain on your expensive equipment

Virus could not harm your computer because there would be no room in RAM for them to run

You'll be taking full advantage of the Linux operating system by using the swap space. Normally the swap space module is not used and all it does is clog up precious ram.

The additional energy that is needed to utilize the entirety of the ram can warm the immediate area. Useful during winter time

You can add "Complex and Special Memory Allocation Engineering" to your resume.

Code can use an array of line indexes. This index array can be kept in memory at a fraction of the size of the large file. Access to any line is accomplished quickly via this lookup, a seek with fsetpos() and an fread()/fgets(). As the lines are edited, the new lines can be saved, in any order, in temporary text file. Saving of the file reads both the original file and temp one in sequence to form and write the new file.

Additionally, with enormous files, the array line_index[] itself can be realized in disk memory too. Access to is easily computed. In an extreme sense, only 1 line of the file needs to in memory at any time.

About me

Recent Questions

I am hosting my website on PythonAnywhere.
I have two databases on the service: myusername$production and myusername$dev.
The production db backs itself up every night and loads the dev database with the most recent batch of live data. I am tryin...

There are multiple files in /opt/dir/ABC/ named allfile_123-abc allfile_123-def allfile_123-ghi allfile_123-xxx.
I need the files to be named new_name-abc.pgp new_name-def.pgp new_name-ghi.pgp new_name-xxx.pgp and then moved to /usr/tst/output
for ...

I'm trying to save some XML(UBL) documents into Marklogic 8 with date formatting incorrect like this:
<cbc:IssueDate>2017-06-32</cbc:IssueDate>
I'm using Java API to save it but can't save because of the next exception:
Server Mess...

In my project I want to manage 3 repositories:
PARENT
CHILD1
CHILD2
The reason I'm not using only one PARENT repository that contains CHILD1+CHILD2 is that CHILD1\2 can be used in other projects also.
In a specific stage I want to tag all 3 reposi...

Although there are several similar questions here (e.g., SVN - Reintegration Merge error: "must be ancestrally related") and on the Web, I wasn't able to figure out why svn mergeinfo <file> causes an error in all of my SVN repositories. In cont...