Parsing CSV Data Using ColdFusion

As part of my exploration of writing, reading, and creating Microsoft Excel documents using ColdFusion, I have come across the need to parse comma-separated-value (CSV) data files. While this seems at first like a relatively simple task, I soon found out that it was ANYTHING but simple. It's one thing to worry about lists (for which ColdFusion is the bomb-diggity), but it's an entirely other thing to worry about lists that have field qualifiers, escaped qualifiers, escaped qualifiers that might be empty fields, and non-qualified field values all rolled into one.

I tried looking it up in Google but could not find any satisfactory algorithms (translates to: code that I could understand). Everything on CSV seems to be in Java and none the stuff on CFLib.org seems to comply with the range of CSV values (especially qualified fields). So, in typical blood-and-guts fashion, I sat down and tried to write my own algorithm. This proved to be easy at first until I found out that my approach was highly flawed. I went through about three different implementations over the weekend of the algorithm before I came up with something that seemed to work satisfactorially.

It has to evaluate each character at a time, which probably won't scale or perform nicely. I would have liked to harness the power of CFHttp to convert CSV files to queries, but I could not get CFHttp to work on the LOCAL file system (ie. a URL that begins with "file:"). If anyone knows of great way to do this, please let me know. I suppose that I could written a temporary file to a public folder and then performed a CFHttp to it, then deleted it, but that just felt a bit "hacky." However, in the end that might just prove to be the way to go.

So anyway, this is what I have come up with. It is a function that takes either a chunk of CSV data or a file path to a CSV data file (text file) and converts it to an array of arrays. It assumes that each record is separated by a return character followed optionally by a new line. Not sure if that is cross system compliant, but heck, this is my first attempt:

<cffunction

name="CSVToArray"

access="public"

returntype="array"

output="false"

hint="Takes a delimited text data file or chunk of delimited data and converts it to an array of arrays.">

<!--- Define the arguments. --->

<cfargument

name="CSVData"

type="string"

required="false"

default=""

hint="This is the raw CSV data. This can be used if instead of a file path."

/>

<cfargument

name="CSVFilePath"

type="string"

required="false"

default=""

hint="This is the file path to a CSV data file. This can be used instead of a text data blob."

/>

<cfargument

name="Delimiter"

type="string"

required="false"

default=","

hint="The character that separate fields in the CSV."

/>

<cfargument

name="Qualifier"

type="string"

required="false"

default=""""

hint="The field qualifier used in conjunction with fields that have delimiters (not used as delimiters ex: 1,344,343.00 where [,] is the delimiter)."

/>

<!--- Define the local scope. --->

<cfset var LOCAL = StructNew() />

<!---

Check to see if we are dealing with a file. If we are,

then we will use the data from the file to overwrite

any csv data blob that was passed in.

--->

<cfif (

Len( ARGUMENTS.CSVFilePath ) AND

FileExists( ARGUMENTS.CSVFilePath )

)>

<!---

Read the data file directly into the arguments scope

where it can override the blod data.

--->

<cffile

action="READ"

file="#ARGUMENTS.CSVFilePath#"

variable="ARGUMENTS.CSVData"

/>

</cfif>

<!---

ASSERT: At this point, whether we got the CSV data

passed in as a data blob or we read it in from a

file on the server, we now have our raw CSV data in

the ARGUMENTS.CSVData variable.

--->

<!---

Make sure that we only have a one character delimiter.

I am not going traditional ColdFusion style here and

allowing multiple delimiters. I am trying to keep

it simple.

--->

<cfif NOT Len( ARGUMENTS.Delimiter )>

<!---

Since no delimiter was passed it, use thd default

delimiter which is the comma.

--->

<cfset ARGUMENTS.Delimiter = "," />

<cfelseif (Len( ARGUMENTS.Delimiter ) GT 1)>

<!---

Since multicharacter delimiter was passed, just

grab the first character as the true delimiter.

--->

<cfset ARGUMENTS.Delimiter = Left(

ARGUMENTS.Delimiter,

1

) />

</cfif>

<!---

Make sure that we only have a one character qualifier.

I am not going traditional ColdFusion style here and

allowing multiple qualifiers. I am trying to keep

it simple.

--->

<cfif NOT Len( ARGUMENTS.Qualifier )>

<!---

Since no qualifier was passed it, use thd default

qualifier which is the quote.

--->

<cfset ARGUMENTS.Qualifier = """" />

<cfelseif (Len( ARGUMENTS.Qualifier ) GT 1)>

<!---

Since multicharacter qualifier was passed, just

grab the first character as the true qualifier.

--->

<cfset ARGUMENTS.Qualifier = Left(

ARGUMENTS.Qualifier,

1

) />

</cfif>

<!--- Create an array to handel the rows of data. --->

<cfset LOCAL.Rows = ArrayNew( 1 ) />

<!---

Split the CSV data into rows of raw data. We are going

to assume that each row is delimited by a return and

/ or a new line character.

--->

<cfset LOCAL.RawRows = ARGUMENTS.CSVData.Split(

"\r\n?"

) />

<!--- Loop over the raw rows to parse out the data. --->

<cfloop

index="LOCAL.RowIndex"

from="1"

to="#ArrayLen( LOCAL.RawRows )#"

step="1">

<!--- Create a new array for this row of data. --->

<cfset ArrayAppend( LOCAL.Rows, ArrayNew( 1 ) ) />

<!--- Get the raw data for this row. --->

<cfset LOCAL.RowData = LOCAL.RawRows[ LOCAL.RowIndex ] />

<!---

Replace out the double qualifiers. Two qualifiers in

a row acts as a qualifier literal (OR an empty

field). Replace these with a single character to

make them easier to deal with. This is risky, but I

figure that Chr( 1000 ) is something that no one

is going to use (or is it????).

--->

<cfset LOCAL.RowData = LOCAL.RowData.ReplaceAll(

"[\#ARGUMENTS.Qualifier#]{2}",

Chr( 1000 )

) />

<!--- Create a new string buffer to hold the value. --->

<cfset LOCAL.Value = CreateObject(

"java",

"java.lang.StringBuffer"

).Init()

/>

<!---

Set an initial flag to determine if we are in the

middle of building a value that is contained within

quotes. This will alter the way we handle

delimiters - as delimiters or just character

literals.

--->

<cfset LOCAL.IsInField = false />

<!--- Loop over all the characters in this row. --->

<cfloop

index="LOCAL.CharIndex"

from="1"

to="#LOCAL.RowData.Length()#"

step="1">

<!---

Get the current character. Remember, since Java

is zero-based, we have to subtract one from out

index when getting the character at a

given position.

--->

<cfset LOCAL.ThisChar = LOCAL.RowData.CharAt(

JavaCast( "int", (LOCAL.CharIndex - 1))

) />

<!---

Check to see what character we are dealing with.

We are interested in special characters. If we

are not dealing with special characters, then we

just want to add the char data to the ongoing

value buffer.

--->

<cfif (LOCAL.ThisChar EQ ARGUMENTS.Delimiter)>

<!---

Check to see if we are in the middle of

building a value. If we are, then this is a

character literal, not an actual delimiter.

If we are NOT buildling a value, then this

denotes the end of a value.

--->

<cfif LOCAL.IsInField>

<!--- Append char to current value. --->

<cfset LOCAL.Value.Append(

LOCAL.ThisChar.ToString()

) />

<!---

Check to see if we are dealing with an

empty field. We will know this if the value

in the field is equal to our "escaped"

double field qualifier (see above).

--->

<cfelseif (

(LOCAL.Value.Length() EQ 1) AND

(LOCAL.Value.ToString() EQ Chr( 1000 ))

)>

<!---

We are dealing with an empty field so

just append an empty string directly to

this row data.

--->

<cfset ArrayAppend(

LOCAL.Rows[ LOCAL.RowIndex ],

""

) />

<!---

Start new value buffer for the next

row value.

--->

<cfset LOCAL.Value = CreateObject(

"java",

"java.lang.StringBuffer"

).Init()

/>

<cfelse>

<!---

Since we are not in the middle of

building a value, we have reached the

end of the field. Add the current value

to row array and start a new value.

Be careful that when we add the new

value, we replace out any "escaped"

qualifiers with an actual qualifier

character.

--->

<cfset ArrayAppend(

LOCAL.Rows[ LOCAL.RowIndex ],

LOCAL.Value.ToString().ReplaceAll(

"#Chr( 1000 )#{1}",

ARGUMENTS.Qualifier

)

) />

<!---

Start new value buffer for the next

row value.

--->

<cfset LOCAL.Value = CreateObject(

"java",

"java.lang.StringBuffer"

).Init()

/>

</cfif>

<!---

Check to see if we are dealing with a field

qualifier being used as a literal character.

We just have to be careful that this is NOT

an empty field (double qualifier).

--->

<cfelseif (LOCAL.ThisChar EQ ARGUMENTS.Qualifier)>

<!---

Toggle the field flag. This will signal that

future characters are part of a single value

despite and delimiters that might show up.

--->

<cfset LOCAL.IsInField = (NOT LOCAL.IsInField) />

<!---

We just have a non-special character. Add it

to the current value buffer.

--->

<cfelse>

<cfset LOCAL.Value.Append(

LOCAL.ThisChar.ToString()

) />

</cfif>

<!---

If we have no more characters left then we can't

ignore the current value. We need to add this

value to the row array.

--->

<cfif (LOCAL.CharIndex EQ LOCAL.RowData.Length())>

<!---

Check to see if the current value is equal

to the empty field. If so, then we just

want to add an empty string to the row.

--->

<cfif (

(LOCAL.Value.Length() EQ 1) AND

(LOCAL.Value.ToString() EQ Chr( 1000 ))

)>

<!---

We are dealing with an empty field.

Just add the empty string.

--->

<cfset ArrayAppend(

LOCAL.Rows[ LOCAL.RowIndex ],

""

) />

<cfelse>

<!---

Nothing special about the value. Just

add it to the row data.

--->

<cfset ArrayAppend(

LOCAL.Rows[ LOCAL.RowIndex ],

LOCAL.Value.ToString().ReplaceAll(

"#Chr( 1000 )#{1}",

ARGUMENTS.Qualifier

)

) />

</cfif>

</cfif>

</cfloop>

</cfloop>

<!--- Return the row data. --->

<cfreturn( LOCAL.Rows ) />

</cffunction>

I have chosen to convert the CSV to an array of arrays as I was not sure that you could depend on the constant number of fields per row. Plus, I figure that going from an array to a query (after this step) would be rather easy. Plus, since Excel is not perfectly square cols vs. rows, I figure this was more in-line with where I want to go with it (including it in my ColdFusion POI Utility component).

As you can see, the CSVToArray() ColdFusion function handles mixed length records, empty field values, and qualified fields. It even handles escaped qualifiers (ex. "" becomes ") but this was not demonstrated. While this is not perfect, at least it provides me with a CSV conversion interface that I can use in my POI Utility ColdFusion component. Further down the road, I will be able to swap this out later for a better implementation.

When dealing with lists, use the GetToken() function. It won't ignore empty list elements. This will significantly speed up your function and replace the loop that you are doing. Also Sammy hit the nails on the head with using RegEx to strip out the text between the qualifiers.

Another trick you can use to speed things up is to use GetToken() to populate the empty the empty cells and then use ListToArray() for the conversion. It's alot quicker then creating a Java Object on each call.

I did think of regular expressions, 'cause they are cool, but I wasn't sure how to apply them. Plus I don't think my skills with them would be good enough to handle all the different options that come with CSV formatting. Take for example:

ben,was,here

That is three fields. But this:

"ben,was,here"

is one field. But this:

""ben,was,here""

is three fields; the first starts with a quote literal, and the last field ends with a quote literal. And then this:

""ben,"was,here"""

has two fields.... you get the point? It was just too much for me to wrap my head around. I am sure that regular expressions would rock somehow, I just can't figure it out.

Comma seperated is a good idea with cold fusion becoz it is gonna remove some of difficult queries and the irregularities. while is is easy to retrieve the information at the client end.

It is being used in www.compglobe.com where you are entitled to compose your comment and the comment will be transfered to the CSV file at the server level.www.compglobe.com is also using CSV format to upload the phone no.s if you want to send information to the handset of the recipent to whom you want to delivered the material. www.compglobe.com has various things like message composer and an online radio too.

You can use CSV parsing to get those values; however, if those are the only values in the field, you can simply treat the data as if it were a comma-delimited list. Then, you can either split the list into an array with ListToArray(), or even use things like ListGetAt() and ListLen() to loop over the elements of the list and examine each individually.

AWESOME JOB!!! I can't believe this was so difficult to find. You definitely saved HOURS of time and helped meet my deadline. This works great. People like you are what make the net an awesome place for research and learning. Thanks!!

I am typically not a fan of using header rows to do auto-name things (I don't usually trust clients to name things appropriately); but this seems to be something that people always are asking about. I will come up with something that makes this a bit easier to work with. I'll get back to you.

Do you have any idea how to accommodate for a multilingual csv? Most (but not all) of the languages/characters pass through fine. It seems that Chinese and Russian are having the most trouble being interpreted. I'm guessing this is an issue with the charset, but I am not positive (nor am I sure as to how I would go about fixing this issue).

THANK YOU! (I'm not yelling, just excited). I've been using Coldfusion longer than I like to think... I never received any training... I just read Ben Forte's book... and I was off. That said, if I hadn't found your code I would have been forced to hack together some nasty bit of code that would have caused me more trouble than good.

Question:

I need to import the data from the array into a database. I know that I can loop through a List but not an array.... any good suggestions how to easily import from an array?

I've been using your script for some time now but I'm still having trouble with any field that has a ". My file is delimited by TAB with no quotes. I have Qualifier set to "" (nothing) but the routine sees a " it ignores any more tabs in that record (concatenates all the rest of the fields for that one record into the field that had the " in it. Here are my parameters.

I ran into an issue where there are line breaks in the middle of the qualified text. I thought there would be an easy way to ignore or remove those via regex before running this function but am struggling. Any ideals? Thanks!

It doesn't seem to matter which code example I try using from all the sources and examples provided from your blog post, but all the examples seem to through a coldfusion.runtime.Struct cannot be cast to java.lang.String

Got any suggestions of why? On a CF10 server, though it could be because of the code being built for a CF8 server...

I am the co-founder and lead engineer at InVision App, Inc — the world's leading prototyping,
collaboration & workflow platform. I also rock out in JavaScript and ColdFusion 24x7 and I dream about
promise resolving asynchronously.