This is another attempt to parse CSV (comma separated values) values according to the CSV standard file format. What makes parsing CSV data so difficult is that the field and record delimiters may or may not be delimiters. If either delimiter is within a qualified field, then that character is just a character literal and NOT a functional delimiter. Along the same lines, qualifiers can be escaped and used as character literals within a qualified field.

This all makes for quite a headache. This ColdFusion user defined function (UDF), CSVToArray(), can handle all of that. It can handle embedded field (ex. comma) and record (ex. new line) delimiters as both delimiters and embedded characters. It does this be splitting the entire string on both delimiters and then looping over the tokens. This method has the advantage of not having to check individual characters, which should be faster, but it does have the downside of also having to keep track of the individual delimiters which can be slower.

Keeping track of the delimiters in their own array was a new approach for me. I basically stripped out all the non-delimiter characters from the original string and then got the character array of that string. This put each delimiter into its own array index. This was tremendously helpful because after each token value retrieval, I could easily check to see which delimiter came just before or just after it. While this adds overhead, without this information, I would NEVER have been able to gracefully handle embedded record delimiters.

To demonstrate that this works with both simple values and with embedded delimiters (both field and record), let's build a CSV value that has it all:

<!---

Build the CSV string value. This value should contain

both functional and character-literal delimiters.

--->

<cfsavecontent variable="strCSV">

Name,Nickname,Best Asset

Sarah,"""Stubbs""",Butt

Ashley,"Value with

embedded line break",Smile

Heather

"Kat",",",

</cfsavecontent>

Things to notice in the above:

Not all fields need to be qualified.

Stubbs has embedded field qualifiers (quotes).

The third record has an embedded record delimiter (line break).

The fourth record only has one value.

The fifth record has an embedded field delimiter (comma).

Now, let's take that CSV and pass it to the ColdFusion UDF, CSVToArray():

<!---

Send the CSV value to the UDF. Be sure to trim the value

(leading and trailing spaces). We are going to leave the

default delimiter (,) and qualifier (").

--->

<cfset arrResult = CSVToArray(

CSV = strCSV.Trim()

) />

<!--- Dump out the resultant array. --->

<cfdump

var="#arrResult#"

label="CSV Results Array"

/>

This gives us the following CFDump output:

Notice that all the embedded values where properly handled! How sweet-ass-sweet is that? Also notice that the results can handle variable-length records (just like an actual Excel file can). Who says all of our records have to have the same number of values?

I am not sure how well this will scale. Once we get the array of delimiters and break the CSV value up into the raw token array, the time to process should increase linearly with the size of the CSV data. I do know that this has GOT to be more preformant that dealing with individual character values. At the very least, this gives me something I can integrate with my POI Utility ColdFusion component. It can always be optimized later.

The only little "bug" that I will fix at one point is that if there are NO CSV values, the UDF still returns an array consisting of one empty array (which could be considered a single CSV value that is empty). Ideally, this should return an empty results array. Oh well (easy fix, but I don't have time to worry about that).

Glad you like the solution, or at least the documentation. As far as what is "normal" for processing time, I can't really say. I have never really tested this on a large file. Certainly, to deal with the embedded qualifiers and delimiters, there is a lot more parsing going on since it has to has to go through almost character by character.

Maybe someone else can speak to this? My solution might not be the best for large scale operations.

I just re-wrote the entire CSV parsing algorithm. It's a fraction of the code and lightening fast. I just parsed 50,000 records in 11 seconds. I just had a Eureka moment working on an unrelated problem. I will try to post a blog entry on it tonight or tomorrow morning and will keep you updated!