Reading and Writing CSV Files in C#

Introduction

A common requirement is to have applications share data with other programs. Although there are interfaces available to work with, for example, Microsoft Excel data files, this approach is generally complex, involves a fair amount of overhead, and requires that support libraries accompany your application.

Comma-Separated Values (CSV) Files

A much simpler way to have your application share data is by reading and writing Comma-Separated Values (CSV) files. CSV files can easily be read and written by many programs, including Microsoft Excel.

For the most part, reading and writing CSV files is trivial. As the name suggestions, a CSV file is simply a plain text file that contains one or more values per line, separated by commas. Each value is a field (or column in a spreadsheet), and each line is a record (or row in a spreadsheet).

However, there is slightly more work involved. Double quotes are used to wrap values that contain commas so that the commas are not interpreted as a value separator. The same is also done for values that contain double quotes. In addition, two double quotes together signify a double quote in the value and not a value separator.

So this seems like a perfect task for a handy little C# class. Listing 1 shows my CsvFileWriter and CsvFileReader classes.

Because the .NET stream classes generally seem to be split into reading and writing, I decided to follow that pattern with my CSV class and split it into CsvFileWriter and CsvFileReader. This also simplifies the code because neither class needs to worry about which mode the file is in or protect against the user switching modes.

The writer class performs any encoding necessary, as I described above, and the reader class performs any necessary decoding.

Using the code

Both classes take a CsvRow argument. The CsvRow class derives from List<string>, so it's basically just a list of strings.

When you call CsvFileWriter.WriteRow(), the row argument specifies the string values to write out. And when you call CsvFileReader.ReadRow(), the row argument returns the values that were read in.

CsvFileReader.ReadRow() also places the entire line into the CsvRow.LineText member, just in case the caller wants to inspect that.
Finally, CsvFileReader.ReadRow() returns a Boolean value that is false when no values could be read at the current line. Under normal circumstances, this would indicate the end of the file.

Conclusion

That's about all there is to it. The classes are fairly simple so I didn't include a sample project. All the code for the classes is shown in Listing 1. Note that this code, as presented, does not handle quoted values that span multiple lines.

This code should be helpful for anyone wanting an easy way to share data with Microsoft Excel or any other program that can read or write CSV files.

License

Share

About the Author

Jonathan Wood has been a software developer for more years than he cares to admit. His current focus is on using C# and ASP.NET MVC to develop website applications, and C++ and MFC to develop desktop applications. His consulting company, SoftCircuits, is known for producing various commercial and shareware products.

Having an entrepreneurial spirit, Jonathan also has a number of other online businesses that he built from scratch such as Black Belt Coder, SC Web Group and others. He also has expertise in designing brands (including graphics and logos) and search-engine optimization (SEO).

Jonathan is always willing to discuss consulting work or joint ventures with people looking to develop software or online businesses.

Comments and Discussions

I have a generic list array with point type which have x and y coordinates.I need to put those values in csv file. I think I need to change the main window class. How can I do that? Can I do this with this code? Any ideas?

The code is not robust for general CSV format e.g. as can be generated with MS-Excel:

a field may also contain new lines (then it must be quoted field)

if the file comes from MS-Excel (or from powershell), then the current localization tells what the field or list separator[^] is - it is not always a comma (e.g. in my de-CH localization, it is semi-colon)

In addition to that, I would have reservations for maintaining the parsing code with all the position tweaking... Why not using established parsing techniques by properly tokenizing the data and parse the tokens?

First, I consider multi-line CSV files to be an extension to the format. Something that would be nice to implement but hardly needed to make the code useful. I've processed hundreds of CSV files generated by Excel without trouble.

Regarding your last point, I do, in effect, tokenize the data. You seem to be referring to a specific technique. If you could explain or provide a reference to what you mean specifically, I could consider what you are saying.

I consider Excel as possible source for CSV files - and what Excel can gernerate should be processed by a CSV parser (my claim).

In my work I am faced with localized files (I get files produced en-UK, en-US, de-DE, de-CH, etc.), and I am faced with multi-line fields. I did ask myself, if I was able to enhance your code to support these two features (localized separator and multi-line fields). I came to the conclusion that I would rather rewrite the parser.

The crucial thing is the separation of the scanning from the parsing: the parsing implements 1:1 the grammar as stated above. The scanner provides the tokens (can be exchanged by any other implementation as long as the interface is satisfied - eases testing anyways ). E.g. if one needs a scanner that can work on streams instead of preloading all data at once, the scanner can be replaced by such an implementation.

Likewise, processing of the parsed record is easily replaced.

So, to summarize, my main concern was the missing separation of scanner and parser. I hope my example code above illustrates what I meant by this.

Thanks for the detailed reply. The code you posted is certainly interesting.

I can see you are a big fan of RegEx, which I'm not so much. I find RegEx both cryptic and restrictive, and I don't consider RegEx to be an "established parsing technique". Certainly, I wouldn't expect products like language compilers to use RegEx. As a result, I prefer my "position tweaking" for processing text input.

Regarding your point about keeping the tokenizing separate from the parsing, I would tend to agree. I wrote a little language interpreter that worked that way. However, I didn't see much advantage to doing that when parsing something as simple as a CSV file. Some of the other things your code provides, such as the use of interfaces, could also provide advantages, although just not for any needs I've run into.

But, again, your code is interesting and I think it adds value to this article by offering an alternative approach.

To my experience, separating tokenizing from parsing pays out immediately in terms of maintainability and testing. I've seen many codes that started off "simple" like CSV parsing and ended up uggly since some detail was overlooked and had to be hacked in in some way.

E.g. I've just noticed that I made a mistake in my grammar: the field definition is too restrictive regarding word:

So, why not take/learn that language if one is confronted with scanning/parsing (the same holds for specifying the grammar for the parser: take any variant of BNF or EBNF)?

Regex is cryptic, but C was cryptic to me too in the beginning... .
Using Regex in C# has admittedly some limitations (especially speed may be one), but they are weighted up in my eyes with maintainability (assuming you know Regex well enough). My view is: if you intend to process huge data where speed is of relevance (e.g. write a C# compiler), C# is not the first choice to do so... (I would go for C++11).

Finally, for more complex languages, you should employ one of the above mentioned tools to generate a parser anyways...

By "cell lines", i meant multi-lines cells. I don't see any problem with the way you're handling double quotes.
I just posted another version of your code which handles double quotes, multi-lines fields and custom delimiter (in french format, the delimiter is ';')

Please note that I have completely reworked my code, making it more robust and adding new functionality such as support for multi-line fields, custom delimiter and quote characters and options for how empty lines in the input file are handled.

This regards your new code (linked above on your blackbelt coder site). There are a number of issues I find:

1) debug.assert... this occurs 3 times in the code. It throws an error "The name 'Debug' does not exist in the current context".

Also, not sure if that is kind of code I would want to use, can you please offer some insight into it's inclusion, and/or how to get it to work properly?

2) To look at your "Project" sample code requires creating an id to log into the website - why? I am not creating yet another single instance of a login to look at the project. Why would you go to all the trouble to create such a lengthy detailed page full of code, and then require a login to get the sample project...why why why...

3) If I do get the debug.assert working.....then I will see if this code behaves the same as the code on this "codeproject" page, then it actually fails when it encounters a quoted element within the line. I can show an example if you like, but I thought maybe you might have fixed it in your new version - so I want to try that first before raising an issue. Short version: The entire section between the double quotes gets dropped from the resultant text being read in.

That's it for now, I appreciate your time and code and do not intend this to sound negative in any way!

I got the debug.assert to compile by adding "using System.Diagnostics;" to my top declarations, might be nice to include those on the page....but they are in your sample project so....either way. I still am not clear if debug.assert should be used in a working code, I don't fully understand it's use here and would still appreciate your advice about it.

I did create yet another website login to get your sample code, but I still appeal that it should not be required...anyway...not a big deal I guess

Now that I have the new code actually compiling, I'll see if it reads the quoted elements properly, and post back here after testing!

The new code reads the quoted section with no problems...so I am not sure if you are interested in looking at the older code on this codeproject website - if you are, here is a sample data row that I am using in which it fails (but works with your new code):

You can see that each field is in quotes, and the second field contains a item embedded within quotes (ends in a double quote). The code here on the codeproject site reads the line and returns everything but is missing the embedded string from the second field so, the missing data is: FR_RM_4_LE10_REG_No_Ret_2.pdf

I fully understand that since the new code fixes the issue, you may not care, but this page still comes up in search results and maybe it might help someone in the future to know that there could be issues...

1. Debug is in the System.Diagnostics namespace. On newer versions of Visual Studio, all you need to do is put the caret over this symbol and press Ctrl+. but you can add a using statement manually if needed. I recommend using Debug.Assert liberally throughout your code to assert any assumptions the code makes. They do not affect release builds.

2. The policy on Black Belt Coder is that a log in is required to download any of the code samples. This is exactly the same policy as on Code Project. Without it, anyone could link to the content as a download from their own site. I'll probably post the code on Code Project some day. As far as why I posted most of the code in the article is because I thought it would be helpful for someone trying to understand the article.

3. I'm not sure I know what you're referring to there. I did make a change so it handles quoted text within a field more like Excel does, however, this is invalid CSV format as all such fields should themselves be enclosed in quotes.

I replied to my own message without refreshing my browser and had not seen your reply. for convenience, I will repeat my response here...sorry about that:

Hi Jonathan,

The new code reads the quoted section with no problems...so I am not sure if you are interested in looking at the older code on this codeproject website - if you are, here is a sample data row that I am using in which it fails (but works with your new code):

You can see that each field is in quotes, and the second field contains a item embedded within quotes (ends in a double quote). The code here on the codeproject site reads the line and returns everything but is missing the embedded string from the second field so, the missing data is: FR_RM_4_LE10_REG_No_Ret_2.pdf

I fully understand that since the new code fixes the issue, you may not care, but this page still comes up in search results and maybe it might help someone in the future to know that there could be issues...

Yes, as discussed I changed my code to handle this more like Excel does. But, again, I do not consider this to be valid CSV format, and if you wrote this data with my code, it would be formatted differently.

Good to know I addressed someone else's concerns, though, even though I did not know this was a concern to anyone.