Settings Things Up:

I wrote a C# Console application to test many different techniques to read a text file and process the lines contained therein. This isn’t an exhaustive list, but I believe covers how it’s done most of the time.

The code is written in Visual Studio 2012 targeting .Net Framework version 4.5 x64. The source code is available at the end of this blog so you can benchmark it on your own system if you wish.

In a nutshell, the code does the following:

Generates a GUID

Creates a string object with that GUID repeated either 5, 10, or 25 times

Writes the string object to a local text file 429,496 or 214,748 times.

It then reads the text file in using various techniques, identified below, clearing all the objects and doing a garbage collection after each run to make sure we start each run with fresh resources:

#

Technique

Code Snippet

T1

Reading the entire file into a single string using the StreamReader ReadToEnd() method, then process the entire string.

C#

1

2

3

4

5

using(StreamReader sr=File.OpenText(fileName))

{

strings=sr.ReadToEnd();

TestReadingAndProcessingLinesFromFile_DoStuff(s);

}

T2

Reading the entire file into a single StringBuilder object using the ReadToEnd() method, then process the entire string.

C#

1

2

3

4

5

6

using(StreamReader sr=File.OpenText(fileName))

{

StringBuilder sb=newStringBuilder();

sb.Append(sr.ReadToEnd());

TestReadingAndProcessingLinesFromFile_DoStuff(sb.ToString());

}

T3

Reading each line into a string, and process line by line.

C#

1

2

3

4

5

6

7

8

using(StreamReader sr=File.OpenText(fileName))

{

strings=String.Empty;

while((s=sr.ReadLine())!=null)

{

TestReadingAndProcessingLinesFromFile_DoStuff(s);

}

}

T4

Reading each line into a string using a BufferedStream, and process line by line.

C#

1

2

3

4

5

6

7

8

9

10

using(FileStream fs=File.Open(fileName,.....))

using(BufferedStream bs=newBufferedStream(fs))

using(StreamReader sr=newStreamReader(bs))

{

strings;

while((s=sr.ReadLine())!=null)

{

TestReadingAndProcessingLinesFromFile_DoStuff(s);

}

}

T5

Reading each line into a string using a BufferedStream with a preset buffer size equal to the size of the biggest line, and process line by line.

Reading each line into a StringBuilder object, and process line by line.

C#

1

2

3

4

5

6

7

8

9

using(StreamReader sr=File.OpenText(fileName))

{

StringBuilder sb=newStringBuilder();

while(sb.Append(sr.ReadLine()).Length>0)

{

TestReadingAndProcessingLinesFromFile_DoStuff(sb.ToString());

sb.Clear();

}

}

T7

Reading each line into a StringBuilder object with its size preset and equal to the size of the biggest line, and process line by line.

C#

1

2

3

4

5

6

7

8

9

using(StreamReader sr=File.OpenText(fileName))

{

StringBuilder sb=newStringBuilder(g.Length);

while(sb.Append(sr.ReadLine()).Length>0)

{

TestReadingAndProcessingLinesFromFile_DoStuff(sb.ToString());

sb.Clear();

}

}

T8

Reading each line into a pre-allocated string array object, then run a Parallel.For loop to process all the lines in parallel.

C#

1

2

3

4

5

6

7

8

9

10

11

12

13

14

AllLines=newstring[MAX];//only allocate memory here

using(StreamReader sr=File.OpenText(fileName))

{

intx=0;

while(!sr.EndOfStream)

{

AllLines[x]=sr.ReadLine();

x+=1;

}

}//CLOSE THE FILE because we are now DONE with it.

Parallel.For(0,AllLines.Length,x=>

{

TestReadingAndProcessingLinesFromFile_DoStuff(AllLines[x]);

});

T9

Reading the entire file into a string array object using the .Net ReadAllLines() method, then run a Parallel.For loop to process all the lines in parallel.

C#

1

2

3

4

5

6

AllLines=newstring[MAX];//only allocate memory here

AllLines=File.ReadAllLines(fileName);

Parallel.For(0,AllLines.Length,x=>

{

TestReadingAndProcessingLinesFromFile_DoStuff(AllLines[x]);

});

Each line in the file is processed by being split into a string array containing its individual guids. Then each string is parsed character by character to determine if it’s a number and if so, so a mathematical calculation based on it.

The generated file is then deleted.

On a Windows 7 64-bit machine with 16GB of memory using a purely 7200 rpm mechanical drive as I didn’t want the effects the memory of a “hybrid” drive or mSata card might have on the system to taint the results.

This trial was run once, waiting 5 minutes after the machine was up and running from a cold start up. This was to eliminate any other background processes starting up with might detract from the test. There was no reason to run this test multiple times because as you’ll see, there are clear winners and losers.

The Runs:

Before starting, my hypothesis was that I expected the techniques that read the entire file into an array, and then using parallel for loops to process all the lines would win out hands down.

Let’s see what happened on my machine. Green cells indicate the winner(s) for that run; yellow second runners up.

All times are indicated in minutes:seconds.milliseconds format. Lower numbers indicate faster performance.

Sha-Bam! Parallel Processing Dominates!

Seeing the results, there is no clear-cut winner between techniques T1 – T7. T8 & T9, which implemented the parallel processing techniques, completely dominated. Those techniques always finished in less than a third (33%) of the time it took any technique processing line by line.

The surprise for me came where each line was 10 guids in length. From that point forward, the .Net inbuilt File.ReadAllLines() method started performing slower. This wasn’t quite so evident when just plain reading a file. However, it indicates that if you really want to micro-optimize your code for speed, always pre-allocate the size of a string array when possible.

In Summary:

On my system, unless someone spots a flaw in my test code, reading an entire file into an array and then processing line-by-line using a parallel loop proved significantly more beneficial than reading a line, processing a line. Unfortunately I still see a lot of C# programmers and C# code running .Net 4 (or above) doing the age old “read a line, process line, repeat until end of file” technique instead of “read all the lines into memory and then process”. The performance difference is so great it even makes up for the loss of time when just reading a file.

This test code is just doing mathematical calculations too. The difference in performance may be even greater if you need to do other things to process your data as well, such as running a database query.

Obviously you should test on your system before micro-optimizing this functionality for your .Net application.

Otherwise, thanks to .Net 4, the age of parallel processing is easily accomplished in C# now. It’s time to break out of old patterns and start taking advantage of the power made available to us.