C# Extracting CSV data into Linq and a Dictionary using Regular Expressions

Mar 182010

I had written a post a while back which detailed a regular expression pattern used by the .Net regex parser which parsed a Comma Separated Value file, or a CSV file for short. Upon looking at the pattern I came to realize that the pattern didn’t work for all situations. So I have created a new pattern which will extract all items from the CSV data into into a dynamic anonymous Linq entity. Following that example I will show one how to use the same Linq entity to put that CSV data into a dictionary, a hash table, where the key of entry is the first column’s data.

Regular Expression Pattern

Pattern needs Regex Options. Those options for this article are defined both in the pattern and the call to the regular expression parser; normalcy only its done only once.

Pattern commented so IgnorePatternWhitespace option is needed. Note that option does not affect the regex parsing of the data.

Multiline option needed so ^ matches the beginning of each line and $ matches the end, after the \r\n.

Regular Expression if condition is used to test if the indivudal column data is enclosed in quotes. If it finds a quote it consumes the quotes but does not pass them on to the final data processing.

Each line will correspond to one match

All data put into named match capture called Column; hence the match will have all line values in the capture collection named Column.

(?xm) # Tell the compiler we are commenting (x = IgnorePatternWhitespace)
# and tell the compiler this is multiline (m),
# In Multiline the ^ matches each start line and $ is each EOL
# -Pattern Start-
^( # Start at the beginning of the line always
(?![\r\n]|$) # Stop the match if EOL or EOF found.
(?([\x27\x22]) # Regex If to check for single/double quotes
(?:[\x27\x22]) # \\x27\\x22 are single/double quotes
(?<Column>[^\x27\x22]*)# Match this in the quotes and place in Named match Column
(?:[\x27\x22])
| # or (else) part of If when Not within quotes
(?<Column>[^,\r\n]*) # Not within quotes, but put it in the column
) # End of Pattern OR
(?:,?) # Either a comma or EOL/EOF
)+ # 1 or more columns of data.

Regex to Linq

Here is the code which will enumerate over each match and add the contents of the match capture collection into a dynamic linq entity. Notes:

tThe code below uses the regex pattern mentioned above but does not show it for brevity.

The regex options are set twice for example. One only needs to set them once.

This is great. However, there is a slight problem with it. If the last column in the CSV is empty then it doesn’t include that column though it does if the empty column is not the last column like in your current example. This causes a problem if the code is expecting a fixed number of columns. I’ve been beating my head against the wall trying to figure out a fix for it but I’m really not very good at regex. Any ideas?

That is the achilles’ heel of that pattern. What I recommend that you do is catch that scenario by checking the min and max lengths of the return. If you have a scenario where they don’t match…simply add the missing null to the list. See below to check for the condition. HTH