CSV Parsing algorithms in Java

Hello, has anybody seen well-known/good practice CSV parsing algorithms
in Java? I've been googling about but can't see anything suitable so
far. I'm not interested in using library functions, rather implementing
the algorithm myself (or at least learning how to).

Thanks, I had a look. The reason I'm asking is because I had a graduate
role interview and they asked this as a question, as in to write one. I
didn't know how to anyway, but looking at Roedy's, just the get() method
is 200 hundred lines, am I really expected to know this stuff off by
heart?

Jeffrey Spoon <> writes:
>Thanks, I had a look. The reason I'm asking is because I had a graduate
>role interview and they asked this as a question, as in to write one. I
>didn't know how to anyway, but looking at Roedy's, just the get() method
>is 200 hundred lines, am I really expected to know this stuff off by
>heart?

The correct answer would have been:

»There are dozens of different formal languages, all
referred to by the name of "CSV". Some differ only by
minor details, but these are important, when one wants to
write a parser. So, I would like to invite you to join me
in a process to figure out the exact specifications of the
language you want me to parse or - if available - please
give me a language specification«.

After all such questions would have been cleared, I would have
been able to write a parser from scratch if the interviewer
would have the patience to wait for me to finish it. The Java
SE API documentation at hand might be helpful during this.

In message <-berlin.de>, Stefan Ram
<-berlin.de> writes
>Jeffrey Spoon <> writes:
>>Thanks, I had a look. The reason I'm asking is because I had a graduate
>>role interview and they asked this as a question, as in to write one. I
>>didn't know how to anyway, but looking at Roedy's, just the get() method
>>is 200 hundred lines, am I really expected to know this stuff off by
>>heart?
>
> The correct answer would have been:
>
> â€ºThere are dozens of different formal languages, all
> referred to by the name of "CSV". Some differ only by
> minor details, but these are important, when one wants to
> write a parser. So, I would like to invite you to join me
> in a process to figure out the exact specifications of the
> language you want me to parse or - if available - please
> give me a language specificationâ€¹.
>
> After all such questions would have been cleared, I would have
> been able to write a parser from scratch if the interviewer
> would have the patience to wait for me to finish it. The Java
> SE API documentation at hand might be helpful during this.
>

So that's a no then?

They did specify that some of the values may contain double quotes.
I had two other questions to do as well, in 30 minutes. One was a fairly
advanced SQL question (for me anyway) and the other was easy enough,
about client/server stuff. They left me to write the answers down with
no references other than the question sheet. Oh, and there were some
other multiple choice questions, but they were fairly straightforward.

in message <>, Jeffrey Spoon
('') wrote:
> In message <>, David Segall
> <> writes
>>Jeffrey Spoon <> wrote:
>>
>>>Hello, has anybody seen well-known/good practice CSV parsing algorithms
>>>in Java? I've been googling about but can't see anything suitable so
>>>far. I'm not interested in using library functions, rather implementing
>>>the algorithm myself (or at least learning how to).
>>>
>>>Any pointers appreciated, thanks.
>>Roedy Green has assembled some useful information on this topic.
>><http://mindprod.com/jgloss/csv.html>
>
> Thanks, I had a look. The reason I'm asking is because I had a graduate
> role interview and they asked this as a question, as in to write one. I
> didn't know how to anyway, but looking at Roedy's, just the get() method
> is 200 hundred lines, am I really expected to know this stuff off by
> heart?
>
> Thanks to the others who suggested as well, I'll get around to them.

Heavens, writing a CSV parser is trivial. It's simply a case of a
StringTokenizer in a for loop:

while ( tok.hasMoreTokens())
{
// do something with result and tok.nextToken()
}
}
/* consider (and document) whether it's your or the caller's
* responsibility to close the stream; since you were passed the
* stream I suggest it's the caller's */

return result;
}

As to what that ResultClass object should be, if the first line in your CSV
may be column headers and each value in the first row is distinct then
probably what you want is a vector of maps where the keys of the maps are
the corresponding values from the first line; otherwise I'd probably just
return a vector of vectors.

Obviously you may not want to schlurp a whole CSV file into core memory at
one go; it may be better to produce a parser to which you can add
callbacks/listeners for the fields or patterns you are interested in. But
the general pattern is as given.

-- (Simon Brooke) http://www.jasmine.org.uk/~simon/
;; Let's have a moment of silence for all those Americans who are stuck
;; in traffic on their way to the gym to ride the stationary bicycle.
;; Rep. Earl Blumenauer (Dem, OR)

"Simon Brooke" <> wrote in message
news:...
> in message <>, Jeffrey Spoon
> ('') wrote:
>
>> In message <>, David Segall
>> <> writes
>>>Jeffrey Spoon <> wrote:
>>>
>>>>Hello, has anybody seen well-known/good practice CSV parsing algorithms
>>>>in Java? I've been googling about but can't see anything suitable so
>>>>far. I'm not interested in using library functions, rather implementing
>>>>the algorithm myself (or at least learning how to).
>>>>
>>>>Any pointers appreciated, thanks.
>>>Roedy Green has assembled some useful information on this topic.
>>><http://mindprod.com/jgloss/csv.html>
>>
>> Thanks, I had a look. The reason I'm asking is because I had a graduate
>> role interview and they asked this as a question, as in to write one. I
>> didn't know how to anyway, but looking at Roedy's, just the get() method
>> is 200 hundred lines, am I really expected to know this stuff off by
>> heart?
>>
>> Thanks to the others who suggested as well, I'll get around to them.
>
> Heavens, writing a CSV parser is trivial. It's simply a case of a
> StringTokenizer in a for loop:
>
> public ResultClass parse( InputStream in, String separatorChars)
> throws IOException
> {
> ResultClass result = new ResultClass();
> BufferedReader buffy =
> new BufferedReader( new InputStreamReader( in));
>
> for ( String line = buffy.readLine(); line != null;
> line = buffy.readLine)
> {
> StringTokenizer tok =
> new StringTokenizer( line, separatorChars);
>
> while ( tok.hasMoreTokens())
> {
> // do something with result and
> tok.nextToken()
> }
> }
> /* consider (and document) whether it's your or the
> caller's
> * responsibility to close the stream; since you were
> passed the
> * stream I suggest it's the caller's */
>
> return result;
> }
>
> As to what that ResultClass object should be, if the first line in your
> CSV
> may be column headers and each value in the first row is distinct then
> probably what you want is a vector of maps where the keys of the maps are
> the corresponding values from the first line; otherwise I'd probably just
> return a vector of vectors.
>
> Obviously you may not want to schlurp a whole CSV file into core memory at
> one go; it may be better to produce a parser to which you can add
> callbacks/listeners for the fields or patterns you are interested in. But
> the general pattern is as given.
>
> --
> (Simon Brooke) http://www.jasmine.org.uk/~simon/
> ;; Let's have a moment of silence for all those Americans who are stuck
> ;; in traffic on their way to the gym to ride the stationary bicycle.
> ;; Rep. Earl Blumenauer (Dem, OR)

Simon Brooke wrote:
>
> Heavens, writing a CSV parser is trivial. It's simply a case of a
> StringTokenizer in a for loop:
> [...]

There is no one official "CSV format," but even the simple
version described at http://www.wotsit.org/ is not parseable by
a mere StringTokenizer (which the JavaDoc calls a "legacy class"
whose use in new code is "discouraged," by the way).

Parsing CSV -- even allowing for some variations beyond the
wotsit description -- is not difficult, but not trivial. My own
CSVReader class runs to 376 lines, including JavaDoc. (It could
probably be tightened a bit; I wrote it as an exercise when I was
new to Java and would likely do things differently nowadays.)

> Hello, has anybody seen well-known/good practice CSV parsing algorithms
> in Java? I've been googling about but can't see anything suitable so
> far. I'm not interested in using library functions, rather implementing
> the algorithm myself (or at least learning how to).
>
> Any pointers appreciated, thanks.

Nothing based on naive use of pattern matching can possibly parse CSV since
fields may contain separator tokens. Indeed a field may contain an entire
CSV-format sub-file (and so on recursively).

If /I/ had set this exercise then my (hidden) purpose would have been to filter
out candidates who don't realise that this is a reasonably complex parsing
task, and not solvable with simple minded tools like regexps[*].

The probability (I think) is that the OP's interviewer was someone who would
have failed my test ;-)

Mind you, I wouldn't have set this task -- too challenging for the context.
Unless, perhaps, I were interviewing for very senior engineers and I was
expecting them to show that they could think realistically under pressure by
answering "that's too complicated to do here and now".

-- chris

([*] Using regexps is nearly always a sign that the program is broken -- there
are not many tasks for which they are (part of) the correct solution.)

In message <>, Simon
Brooke <> writes
>>
>> Thanks to the others who suggested as well, I'll get around to them.
>
>Heavens, writing a CSV parser is trivial. It's simply a case of a
>StringTokenizer in a for loop:
>

Except I wasn't allowed to use String Tokenizer, as I said in the
original post, "I'm not interested in using library functions".

in message <>, Jeffrey Spoon
('') wrote:
> In message <>, Simon
> Brooke <> writes
>
>>>
>>> Thanks to the others who suggested as well, I'll get around to them.
>>
>>Heavens, writing a CSV parser is trivial. It's simply a case of a
>>StringTokenizer in a for loop:
>
> Except I wasn't allowed to use String Tokenizer, as I said in the
> original post, "I'm not interested in using library functions".

Then write your own; it's a trivial thing to do. Here, in fact, is one I
wrote earlier:

/**
* MIDP does not provide a StringTokenizer. Because this has to be
* compatible with MIDP we'll provide our own. If you have access to a real
* StringTokenizer don't use this one - it is minimal and possibly
* inefficient.
*/
public class StringTokenizer
{
//~ Instance fields -----------------------------------------------

Share This Page

Welcome to The Coding Forums!

Welcome to the Coding Forums, the place to chat about anything related to programming and coding languages.

Please join our friendly community by clicking the button below - it only takes a few seconds and is totally free. You'll be able to ask questions about coding or chat with the community and help others.
Sign up now!