Regular Expressions Make CSV Parsing In ColdFusion So Much Easier (And Faster)

Today, I was working on some data parsing when it hit me like a bolt of lightening! Why not just use Regular Expressions to parse CSV (comma separated values) data files? For those of you who have seen my previous attempts at parsing CSV data files using ColdFusion, you will see that having to handle embedded field qualifiers and data delimiters made it a character by character parsing problem since you can no longer treat it like a list of lists (both comma a new line delimited). It was a huge amount of code to do a huge amount of work.

But then, as I was working today, suddenly I realized that working with embedded field qualifiers was exactly like working with embedded quotes in HTML tag attribute value parsing; it's just a pattern. So, what is the pattern of the CSV file. At it's highest level, it's a field value followed by an optional delimiter. I say "optional" delimiter because the last data field in the file will not be succeeded by a delimiter. The field value is then an optional qualified value containing zero or more characters. The allowable characters in the field are determined be the qualification of the data.

The regular expression I came up with was so short it made me giggle:

("(?:[^"]|"")*"|[^",\r\n]*)(,|\r\n?|\n)?

This ColdFusion compatible regular expression captures two groups; the first group is the field value (zero or more characters) and the second group is the optional succeeding delimiter. If this is totally not readable (as most regular expressions are not), I have converted this to a verbose regular expression (?x) with my typical amount of commenting so that you can really see how the pattern of the data field can be matched:

<!--- Create verbose regular expression. --->

<cfsavecontent variable="strRegex">(?x)

## Every CSV "Group" consists of a field value

## followed by an optional delimiter. This delimiter,

## if it exists, will be either a field delimiler or

## a line delimiter. If no delimiter exists, then we

## are at the end of the file.

## Let's get the field value. We need to consider two cases

## in field values. Either the value is qualified or it is

## not. According to standards, values that have an

## embedded field or line delimiter or qualifier, MUST BE

## contained in a qualified field.

(

## Because the qualified fields are the exception case,

## we want to check for those first. A qualified field

## can contain non quotes AND escaped quotes.

"(?:[^"]|"")*"

| ## OR

## If we did not match the qualified field token, then

## we need to check for the non-qualified field token

## which can be zero or more characters NOT consisting

## of any qualifier or delimiter.

[^",\r\n]*

)

## Now that we have captured the zero+ length field value,

## we can get the token. Remember, if we are at the end

## of the file, the token will not exist and therefore,

## this group must be optional (?).

(

## Comma

,

## OR return with optional newline.

|\r\n?

## OR just new line (I include this option for safer

## text values across operating systems (but this might

## not be necessary).

|\n

)?

</cfsavecontent>

So now that you see how simple the pattern is, let's use it to start parsing our CSV data in ColdFusion. Luckily, from within ColdFusion, we have access to the Java Pattern and the Java Pattern Matcher which make our lives almost too easy when it comes to iterating over a string. As we iterate over the string, we are going to store our CSV values in an array of arrays in which each sub array will represent a row from the CSV data file.

<!---

Save CSV data values. Here, we are creating a CSV data

value that has both qualified and non-qualified field

values, populated and empty field values, and embedded

field qualifiers and field/line delimiters.

--->

<cfsavecontent variable="strCSV">

"Name","Nick Name","Age","Hair Color"

Kim,"Kim ""Hot Legs"" Smith",24,"Brunette"

"Sarah Vivenz, II","Stubs",27,"Brunette"

"Kit Williams",Kitty,34,Blonde,,,

"Even

Values With

Embedded Line Breaks"

</cfsavecontent>

<!--- Trim data values. --->

<cfset strCSV = Trim( strCSV ) />

<!---

Get the regular expression to match the tokens. I have

put the field value on the first line and delimiters on

the second line for easier reading.

--->

<cfset strRegEx = (

"(""(?:[^""]|"""")*""|[^"",\r\n]*)" &

"(,|\r\n?|\n)?"

)/>

<!---

Create a compiled Java regular expression pattern object

based on the pattern above.

--->

<cfset objPattern = CreateObject(

"java",

"java.util.regex.Pattern"

).Compile(

JavaCast( "string", strRegEx )

)

/>

<!---

Get the pattern matcher for our target text (the CSV data).

This will allows us to iterate over all the data fields.

--->

<cfset objMatcher = objPattern.Matcher(

JavaCast( "string", strCSV )

) />

<!---

Create an array to hold the CSV data. We are going

to create an array of arrays in which each nested

array represents a row in the CSV data file.

--->

<cfset arrData = ArrayNew( 1 ) />

<!--- Start off with a new array for the new data. --->

<cfset ArrayAppend( arrData, ArrayNew( 1 ) ) />

<!---

Here's where the magic is taking place; we are going

to use the Java pattern matcher to iterate over each

of the CSV data fields using the regular expression

we defined above.

Each match will have at least the field value and

possibly an optional trailing delimiter.

--->

<cfloop condition="objMatcher.Find()">

<!--- Get the field token value. --->

<cfset REQUEST.Value = objMatcher.Group(

JavaCast( "int", 1 )

) />

<!--- Remove the field qualifiers (if any). --->

<cfset REQUEST.Value = REQUEST.Value.ReplaceAll(

JavaCast( "string", "^""|""$" ),

JavaCast( "string", "" )

) />

<!--- Unesacepe embedded qualifiers (if any). --->

<cfset REQUEST.Value = REQUEST.Value.ReplaceAll(

JavaCast( "string", "(""){2}" ),

JavaCast( "string", "$1" )

) />

<!--- Add the field value to the row array. --->

<cfset ArrayAppend(

arrData[ ArrayLen( arrData ) ],

REQUEST.Value

) />

<!---

Get the delimiter. If no delimiter group was matched,

this will destroy the variable in the REQUEST scope.

--->

<cfset REQUEST.Delimiter = objMatcher.Group(

JavaCast( "int", 2 )

) />

<!--- Check for delimiter. --->

<cfif StructKeyExists( REQUEST, "Delimiter" )>

<!---

Check to see if we need to start a new array to

hold the next row of data. We need to do this if the

delimiter we just found is NOT a field delimiter.

--->

<cfif (REQUEST.Delimiter NEQ ",")>

<!--- Start new row data array. --->

<cfset ArrayAppend(

arrData,

ArrayNew( 1 )

) />

</cfif>

<cfelse>

<!---

If there is no delimiter, then we are done parsing

the CSV file data. Break out rather than just ending

the loop to make sure we don't get any extra data.

--->

<cfbreak />

</cfif>

</cfloop>

<!--- Dump out CSV data array. --->

<cfdump

var="#arrData#"

label="CSV File Data"

/>

There's a lot of commenting going on there (typical Me!), but if you filter all that out, this is only a few lines of code. That's how easy regular expressions make our lives! Compare that with any of my previousColdFusion CSVparsing solutions to see that this one is a small fraction of the size. And, best part is, it works! Running the above code, we get the following CFDump output:

It works on a small file, sure piece of cake, but what about performance? Sometimes we (ok, maybe just me) get sucked so hard into how sexy Regular Expressions are, that we lose sight of the big picture and we end up going down a path that is less performant in the long run. To test this, I am going to create a beefy CSV file with this code:

This creates a 10,000 line CSV file with over 2,150,000 characters. This file comes out to be just over 2 megabytes in size. Now, to make sure I am not messing with the time trials, I am running this script separately than the parsing code.

Now for the test! I ran the current version that employs regular expressions to do the parsing against my previous version which checks characters.

Previous CSV Parsing Algorithm

Average parse time: 100,000 ms (100 seconds)

Regular Expression CSV Parsing Algorithm

Average parse time: 11,000 ms (11 seconds)

Holy Cow! The regular expression solution is about 10 times faster than the previous solution! And, this was no rinky-dinky example; this was a seriously large CSV file (at least in my world).

Just for fun, I regenerated the above test CSV file, except this time, I created a 50,000 record CSV file. This file has 10,750,000 characters in it and comes out at well over 10 megabytes. I didn't even want to bother running this in the old solution. However, I am very pleased to say that all 50,000 records were successfully parsed in just 57,000 ms (57 seconds)! That's pretty sweet-ass-sweet.

Let's stop for a second and just look at the performance scaling. The first run went 10,000 records in 11,000 ms. That comes out to be 909 records per second. The second run went 50,000 records in 57,000 ms. That comes out to be 877 records per second. As the data set grows, this thing is just gonna EAT UP MEMORY (creating this massive array in RAM), but as you can see, the efficiency of the algorithm is fairly constant. This is some awesome news.

Reader Comments

Secondly, I have a question that will wither mean I'm starting to understand Regular Expressions or... well, not.

It looks to me like you've got two groupings that create unused backreferences, namely ("(?:[^"]|"")*"|[^",\r\n]*) and (,|\r\n?|\n) If I understand this correctly, wouldn't (?:"(?:[^"]|"")*"|[^",\r\n]*)(?:,|\r\n?|\n) give better performance because you're not creating unused backreferences?

True, the groups do create back references that can use within the current regular expression. However, it makes the matched pattern much more convenient when grabbing the matched text when you capture the values.

If none of the groups were capturing, I could still grab the whole matching using:

objMatcher.Group()

However, it would not be as easy to figure out where the field value ends and the !optional! delimiter begins. By allowing the field and the delimiter to be captured via a group, then it makes it much more convenient to use:

objMatcher.Group( 1 )

... to get the field value, and

objMatcher.Group( 2 )

... to get the optional delimiter.

So yes, it might be faster in the pattern to not capture, but overall, capturing the groups make the algorithm easier to code and to understand.

Yeah, a real Java package is probably going to outperform anything that I write. But keep in mind that mine runs basic ColdFusion without any additions. I assume that for the CSVParser to work, I would have to add a Java package to the ColdFusion install. This might not be a problem for most people, but I do not have access to do that sort of stuff.

But regardless, thanks for testing it up against the "big boys". Good to see where I fall in line :)

I do a fair amount, well, in fact, I do a large amount of CSV data parsing through a bunch of web services and things like that, and this is a very interesting topic for me.

Untill now I've pretty much cut CF out of the loop entirely as I dont really need to do any real 'manipulation' on the data, I just want to get it parsed into my database, so in the past I've always used a DTS package for SQL Server to do this for me, and more recently now I'm running 2k5 just opted for a stored proc that recieves a string for the file path of the CSV, it then does a BULK LOAD on the file, which seems to work very nicely.

Am I still handling this in the best fasion? or would one of these JAVA classes mixed up with an insert type query be my best bet?

I'm still betting that SQL handles this stuff infinatly faster than CF.

While the above is no longer ColdFusion-compatible, I assume that's not a big deal since you're using Java's regex engine anyway. The somewhat more advanced features it uses include possessive quantifiers, an atomic group, Jeffrey Friedl's "unrolling the loop" pattern, and Java's "\G" (just as an extra assurance that the regex's bump-along mechanism always puts us where we want to be). I've also used two different sets of parentheses to avoid having to remove the outer quotes from quoted values in post-processing. Instead, we can just check if a value was captured to backreference 2, and if so, use it, otherwise use backreference 3.

One other thing I would recommend after a brief lookover is changing the replacement of (""){2} with $1 to simply replacing "" with ". Also, if you're going to use a regex to do that, compile the regex outside the loop.

I needed a fast solution to import a csv list of countries and found your UDF. But did not manage to implement it.

Maybe it´s very simple to call a UDF. I placed it in the same folder from where I called it but no luck. Searched Adobe Knowledge base to found out exactly where to put it, but run out of time so I skiped it and instead used: http://sourceforge.net/projects/csvtosql, then I also had a sql dump file for mySQL server.

This bit of code saved me a lot of work while battling with a 68000 line CSV file. I did modify it slightly as I was reading in the file line by line, so only needed the one dimensional array that would be replaced come the next line of the file. Certainly saved on memory usage.

That's an interesting question. CSV is a "standard". If someone where to deviate from the standard, I am not sure what to do. The regex might just not match, or find things in the incorrect rows. Really, this depends on valid data (just as your compiler depends on valid code).

I know there are some very powerful database-oriented ways to import the CSV data, but I don't have much experience with that. I use this kind of method and it seems to work well, but I do more exporting than importing.

You may want to have a look at my blog post on my battles with extremely large CSV file and memory leakage (all due to compnents, cfquery and cfqueryparam usage, not Ben's CSV parsing routine). http://bit.ly/ZknIx

As far as validation is concerned, I have to take an all or nothing approach. Everything was wrapped in a cftransaction / cftry with a rollback if there were any errors. Any data which wasn't valid (e.g. date, barcode) would throw an error.

For minimum memory usage, use a stored procedure, read the CSV line-by-line (you can easily modify the CSV regex to do so), avoid CFQueryParam if possible, use CFQuery for calling the stored proc, also try to buffer your SQL into an array and execute it every 50 CSV rows (tweak this value for best performance).

As for best practices, I'm not sure of many but I'd say you're better off using the database to do the CSV importing if possible. Maybe handling just validation in CF and then passing it over to the Database for the actual import process.

How does a database-csv-import work? Do you basically set up a datasource to a given file name? And then have CF place the new CSV in that file... which is how I think using an Excel file as a datasource works (although I haven't done that in years).

Well for MS SQL Server 2000+ there is "BULK INSERT". To use it you'd have to validate the CSV file with ColdFusion first, copy it to a drive on the database server and then via CF just use the CFQuery tag to execute the bulk insert SQL.

The only problem I can see with this is that it doesn't comply to CSV standards and only works with a single delimiter, no support for double quotes and extra commas. So it might need the CSV pre-processed to replace the comma delimiter with a control character and remove the double quotes where applicable.

Although I have seen something mentioned about a tool called bcp and providing it with file format information. But that would have to use a CFExecute :(

4000 lines doesn't seem like too much. I guess string parsing takes up more memory that I assume it does. I am not sure what to tell you about that. I suppose you could try using a buffered input stream and parse the file a bit at a time. Definitely more complicated that way though.

Thanks. Before I try that, this is my local computer, Railo on Tomcat - can anyone explain how to check and increase the memory for it? Maybe it's too small but I'm not familiar with managing either or tweaking any Java settings.

Glad the RAM increase worked; sorry that the algorithm was chewing up so much memory.

As far as the comments at the top of the copy-n-paste file, the problem right now is that not all code files are actual ColdFusion (some HTML, some Javascript, some SQL, etc.) As such, I went with a "pseduo comment" since there is nothing in the code right now that differentiates between types of code samples.

I have this working locally on Railo but it doesn't work onsite with BD JX7 (7, not 7.01 or 7.1). The array is outputs is very odd, almost entirely blank then 1 value in an array after 46 rows, and some other odd ones lower down.

(Btw, had to move the comments just after line 123 outside the function or BD says badly formatted.)

If it helps, it outputs an array for every row correctly, and the first 16 have the correct number of columns, then it drops to one column short and later correct again, but no discernible reason why as columns on the csv are missing values throughout, for example.

And throughout they are almost entirely all blank, "[empty string". Even the first row of titles shows up as blanks. Then oddly there is a value in the 2nd column every 40-50+ rows or so.

Hmm, that sounds odd. Regex is something that should be pretty universal, at least amongst the major "flavors". Unfortunately, I know nothing about BlueDragon, so I can't offer any insight on why it would work in Railo and not in BD. Is it possible that BD uses a .NET regex library and Railo / CF use the Java regx library? I don't really know. Sorry.

That is interesting. I am not sure why some of the comments would be breaking BD. I used to put comments IN actual methods calls. I think that even killed ColdFusion *some* of the time. But, I stopped doing that.

That's awesome to hear. Sorry for the long response delay - some of these things just get lost. You ever go to BFusion / BFLEX? Not sure if I am gonna be able to make it this year (probably depends on how much Simon Free will hassle me ;)).

Great article, but I'm shocked at how many smart people don't know how to spell lightning. Nothing hit you like a bolt of lightening. Lightening is what you do when you want to make your teeth brighter... you lighten them. The electricity in the sky during a thunderstorm is lightning. No "E" in the word.

Hello. I would like to use ColdFusion to run the java parsing routine on this web site:

http://www.amugofjava.me.uk/2010/02/parsing-sharepoint-metadata.html

How do I do this in ColdFusion? I tried plugging this code directly into cfscript but no go. Looking for ideas on how to parse SharePoint ows_MetaInfo data into discreet name/value pairs. Here is an example of some meta data I need to parse:

As usual, I finally figured it out on my own, but this thread provided some key info. Anyway, for anyone who needs to parse through SharePoint's "ows_MetaInfo" field when you're calling getlistitems, here is the code that gave me the name/value pairs: