Tuesday, November 22, 2011

How to read in a file in C++

So here's a simple question, what is the correct way to read in a file completely in C++?

Various people have various solutions, those who use the C API, C++ API, or some variation of tricks with iterators and algorithms. Wondering which method is the fastest, I thought I might as well put the various options to the test, and the results were surprising.

First, let me propose an API that we'll be using for the function. We'll send a function a C string (char *) of a filename, and we'll get back a C++ string (std::string) of the file contents. If the file cannot be opened, we'll throw an error why that is so. Of course you're welcome to change these functions to receive and return whatever format you prefer, but this is the prototype we'll be operating on:

std::string get_file_contents(const char *filename);

Our first technique to consider is using the C API to read directly into a string. We'll open a file using fopen(), calculate the file size by seeking to the end, and then size a string appropriately. We'll read the contents into the string, and then return it.

This method is liked by many because of how little code is needed to implement it, and you can read a file directly into all sorts of containers, not just strings. The method was also popularized by the Effective STL book. I'm dubbing the technique "method iterator".

Now some have looked at the last technique, and felt it could be optimized further, since if the string has an idea in advance how big it needs to be, it will reallocate less. So the idea is to reserve the size of the string, then pull the data in.

I will call this technique "method assign", since it uses the string's assign function.

Some have questioned the previous function, as assign() in some implementations may very well replace the internal buffer, and therefore not benefit from reserving. Better to call push_back() instead, which will keep the existing buffer if no reallocation is needed.

Lastly, some want to try another approach entirely. C++ streams have some very fast copying to another stream via operator<< on their internal buffers. Therefore, we can copy directly into a string stream, and then return the string that string stream uses.

Now which is the fastest method to use if all you actually want to do is read the file into a string and return it? The exact speeds in relation to each other may vary from one implementation to another, but the overall margins between the various techniques should be similar.

I conducted my tests with libstdc++ and GCC 4.6, what you see may vary from this.

I tested with multiple megabyte files, reading in one after another, and repeated the tests a dozen times and averaged the results.

Method

Duration

C

24.5

C++

24.5

Iterator

64.5

Assign

68

Copy

62.5

Rdbuf

32.5

Ordered by speed:

Method

Duration

C/C++

24.5

Rdbuf

32.5

Copy

62.5

Iterator

64.5

Assign

68

These results are rather interesting. There was no speed difference at all whether using the C or C++ API for reading a file. This should be obvious to us all, but yet many people strangely think that the C API has less overhead. The straight forward vanilla methods were also faster than anything involving iterators.

C++ stream to stream copying is really fast. It probably only took a bit longer than the vanilla method due to some reallocations needed. If you're doing disk file to disk file though, you probably want to consider this option, and go directly from in stream to out stream.

Using the istreambuf_iterator methods while popular and concise are actually rather slow. Sure they're faster than istream_iterators (with skipping turned off), but they can't compete with more direct methods.

A C++ string's internal assign() function, at least in libstdc++, seems to throw away the existing buffer (at the time of this writing), so reserving then assigning is rather useless. On the other hand, reading directly into a string, or a different container for that matter, isn't necessarily your most optimal solution where iterators are concerned. Using the external std::copy() function, along with back inserting after reservation is faster than straight up initialization. You might want to consider this method for inserting into some other containers. In fact, I found that std::copy() of istreambuf_iterators with back inserter into an std::deque to be faster than straight up initialization (81 vs 88.5), despite a Deque not being able to reserve room in advance (nor does such make sense with a Deque).

I also found this to be a cute way to get a file into a container backwards, despite a Deque being rather useless for working with file contents.

Interesting; I am surprised by the poor performance of the 'heavier' solutions although the naff VS performance /may/ be due to checked iterators. (It has been a long time since I've used VS but IIRC it used to enable checked iterators by default.)

Might be worth comparing the performance to a mmap + memcpy type solution (I think boost provides a memory mapped file wrapper somewhere). Or to see how EKOPath performs which uses the Apache STL.

Tests were done from a RAM drive on an isolated machine with no cron. The tests were also run in random orders. But that's besides the point. The numbers are not to indicate exact speeds of anything. The numbers are to indicate which method averages better performance relative to others.

Averages better is meaningless statistically. The question to ask is "Are these numbers significantly different at 95% confidence?". In order to get that answer you need more data than just the average (ie the std deviation)

Understood. I was just commenting on "The numbers are to indicate which method averages better performance relative to others."In general your methodology is better than most people's that I've seen :)

Hello, thank you for this article. I was leaning toward the assign method, going with the c++ method instead now. Have you tested your results with gcc 4.8? Would love to see a followup where you do the same tests with todays implementations.

In summary: This article discusses the incorrect assumption by some programmers that you can write to the pointer returned by c_str() AND it also discusses how one should not treat the address of a character reference (string[]) as a location that can be written to. It provides the reasons why - those being that the internal format of the string may not be contiguous memory, even though c_str() returns a pointer to contiguous memory.

However it doesn't discuss using &str[0] as a writable buffer pointer when you've resized an empty string. In other words, it would be difficult to conceive of an implementation that would resize a null string to a set of non-contiguous buffers. I presume this is why you feel it's safe to use str.resize() followed by a write to &str[0].

Interesting. In a similar problem, I am trying to read the content of a text file continuously in VS C++ and the content of that text file is changing by python platform at random time interval. While I am doing so VS C++ is not reading the text file except for the first time. Can any one please tell me how to do it.

Wouldn't it have been more efficient to pass the target string via the call to avoid making a copy of the string during the return? At the time you wrote this there weren't any move semantics available in C++, and even now I'm not sure move semantics would kick in.

Or is there something about std::string assignment operators that makes copy operations not produce intermediate duplicates?

My problem solver.i was trying to read a full file neglecting eof. and this blog solved my this issue in no time. Hey insane blogger can you explain me or give reference of exact operations you are performing in in your last method c++.

I'm not sure what the point in your link was, but whoever wrote that code should be drawn and quartered.

A) Because they're using gets(). See: http://insanecoding.blogspot.com/2007/03/what-to-do-about-gets.htmlB) Because they're using an int to store the return value from strlen(). See: http://insanecoding.blogspot.co.il/2010/03/does-anyone-understand-types-and.htmlC) Because they're using the non-standard iostream.h instead of iostream.