This forum is now a read-only archive. All commenting, posting, registration services have been turned off. Those needing community support and/or wanting to ask questions should refer to the Tag/Forum map, and to http://spring.io/questions for a curated list of stackoverflow tags that Pivotal engineers, and the community, monitor.

Wishlist / Coding Examples for the following...

Dec 26th, 2007, 06:14 PM

I've started evaluating Spring Batch and it looks promising. I've worked on several projects that do a fair amount of file batch processing. From my experience, there are a some features that the framework needs to support before I can comfortably recommend that our group use it.

Specifically, does the framework support:

native database bulk-copy commands like Sybase "bcp" or Oracle "bulk load"? For us it's fine if this breaks the transaction boundary demarcations.

optional vs required fields (I assume you'd have to use something like the ValidatingItemProvider for this)

optional vs required record types (I assume you'd have to use something like the ValidatingItemProvider for this)

field padding (left vs right and padding char)

field masks (i.e. mask="MM/dd/yy", or masks similar to the java.text.Format)

field-by-field default values for empty/null fields (i.e. if field1 is empty or blank, default it to today)

On delimited files, what about files that use different delimeters for separating each field (i.e. field1~field2|field3|field4!lastfield/n). Perhaps the FieldSet class could contain an attribute for that info?

Record separators for files that don't use a CR or CR/LF for the end of the line (i.e. field1|field2|field3|field4!lastfieldinrecord~). Perhaps the solution is to use the RecordSeparatorPolicy and/or SuffixRecordSeparatorPolicy?

What if you want to do multi passes of the file - one to validate it (especially useful for files containing multiple record types), then one to process it?

I'm very familiar with Spring but Spring Batch is totally new to me so perhaps it does support what I'm asking and I just overlooked how to accomplish what I'd like.

Can someone please help point me at an example of how to do some of the things I have a question on?

There is a sample job for this (multi-line job). All you need is to define a LineTokenizer for each record type.

optional vs required fields (I assume you'd have to use something like the ValidatingItemProvider for this)

This is something we've discussed, and is definitely possible with the FixedLengthTokenizer by not including a range within the column definition. However, it isn't possible with the DelimitedLengthTokenizer. You could not map a field to a particular object, but with 'automapping' there would be issues. It should be added as an issue in Jira. Can you add one with an example business case where you use optional fields?

optional vs required record types (I assume you'd have to use something like the ValidatingItemProvider for this)

By default, if you use the PrefixMatchingCompositeLineTokenizer, ever record type would be optional. However, you could easily write your own LineTokenizer that knows which record types are optional or required.

field padding (left vs right and padding char)

Padding should work for input (see BATCH-261). And there are setters for padding of fields in the FixedLengthAgreggator. However, it should probably be more fine grained that it is currently.

field masks (i.e. mask="MM/dd/yy", or masks similar to the java.text.Format)

Supported

field-by-field default values for empty/null fields (i.e. if field1 is empty or blank, default it to today)

Tokenizer's don't do this by default, although a FieldSetMapper that you write could easily do it.

On delimited files, what about files that use different delimeters for separating each field (i.e. field1~field2|field3|field4!lastfield/n). Perhaps the FieldSet class could contain an attribute for that info?

There is a setter for the delimiter type in the DelimitedTokenizer, however, it will be used for every field in the file. I'm curious what the use-case would be for having multiple delimiters per file?

Record separators for files that don't use a CR or CR/LF for the end of the line (i.e. field1|field2|field3|field4!lastfieldinrecord~). Perhaps the solution is to use the RecordSeparatorPolicy and/or SuffixRecordSeparatorPolicy?

There is a RecordSeperatorPolicy as part of the FlatFileItemReader.

What if you want to do multi passes of the file - one to validate it (especially useful for files containing multiple record types), then one to process it?

You could easily have multiple steps that correspond to these 'passes'?

Comment

Thanks for all the info - that's very helpful. I'll start to try out your suggestions a bit tonight and will try to open that Jira ticket for the optional/required fields in the next day or two.

As far as Lucas's question regarding:

There is a setter for the delimiter type in the DelimitedTokenizer, however, it will be used for every field in the file. I'm curious what the use-case would be for having multiple delimiters per file?

I guess a usecase would be we have a file that we use pipe delimiters for except for delimiting a few of the fields in the record since those fields might themselves contain pipes. It's not a great example, since one could argue that we should pick a delimiter like x00 or the like that we're guaranteed to never encounter in any of our fields, but unfortunately we're limited to the chars that the system outputting the datafile can generate. I've also used the batch processing tool Ab Initio (look it up on Wikipedia if you're not familar with it) and it's able to handle various chars for delimiting each field.

I'll let you know how I make out.

Comment

Interesting, I've seen a lot of projects pick pipe over comma delimited because of the likelihood of commas being part of the data, but usually pipes are relatively safe. It could be added to the Tokenizer, but it seems like a minority use case and probably out of scope for Release 1. However, please add it to JIRA, and if a lot of others need the feature, it could be moved up.

Also, if a reliable delimiter can't be chosen, is using a fixed-length format a possibility?

Comment

Interesting, I've seen a lot of projects pick pipe over comma delimited because of the likelihood of commas being part of the data, but usually pipes are relatively safe. It could be added to the Tokenizer, but it seems like a minority use case and probably out of scope for Release 1. However, please add it to JIRA, and if a lot of others need the feature, it could be moved up.

Also, if a reliable delimiter can't be chosen, is using a fixed-length format a possibility?

Then why do not support echoing like "\\" in Java (i.e. single delimiter is a delimiter, doubled delimiter is a literal value of a single delimiter). And normally it is not so complicated to double delimiters inside the fields on output - not more complicated then use different delimiters for different fields. And this solution is 100% safe.

Regards,
Oleksandr

Comment

N.B. The DelimitedLineTokenizer adopts the Microsoft-inspired convention that a field containing a delimiter (line or field delimiter) can be escaped by quoting it. Inside such a field a quote character is escaped by repeating it. This is what you get from Excel (for instance) when you do Save As... -> CSV, so it covers a large constituency already. The Javadocs mention this behaviour in the setter for the quote character (which defaults to ").

Comment

It is good that this convention is supported, but it seems to be slightly overcomplicated - simple doubling of the delimiter is simpler to produce and to parse. And should provide (marginally) better performance which may be not so bad in batch applications. As well processing of quoted string requires virtually unlimited look-ahead (especially, if file being processed is misformated), simple delimiter duplication requires only single-character look-ahead and is much safer in this respect.

So it is quite reasonable to support such strategy as well. Anyway, it is not very likely (while still possible), that CSV files for batch-processing would be created by Excel, as Excel is mostly interactive tool.

N.B. The DelimitedLineTokenizer adopts the Microsoft-inspired convention that a field containing a delimiter (line or field delimiter) can be escaped by quoting it. Inside such a field a quote character is escaped by repeating it. This is what you get from Excel (for instance) when you do Save As... -> CSV, so it covers a large constituency already. The Javadocs mention this behaviour in the setter for the quote character (which defaults to ").

Comment

One thing I would also like to point out is that delimiters are set for an entire file. Meaning, that there is one setter for a delimiter that is used. Attempting to set a delimiter per field would require significantly more configuration than there is currently, for very little value. At a minimum, if this feature is needed, it would need to be a separate tokenizer all together, so that the more common use case would be easier to configure. However, with that being said, I still don't understand what setting a delimiter per field would add that couldn't more easily be accommodated by using fixed-length formatting.

Comment

To put it short - size reduction. sometimes very significant.
But anyway, delimiter with possibility to escape it is much better solution then delimiter per field and relatively often, then fixed length.

One thing I would also like to point out is that delimiters are set for an entire file. Meaning, that there is one setter for a delimiter that is used. Attempting to set a delimiter per field would require significantly more configuration than there is currently, for very little value. At a minimum, if this feature is needed, it would need to be a separate tokenizer all together, so that the more common use case would be easier to configure. However, with that being said, I still don't understand what setting a delimiter per field would add that couldn't more easily be accommodated by using fixed-length formatting.

Comment

Hi, I need to implement an ItemReader for an excel file. How do I do this? I've taken a look at this thread because its the closest thing I can find that is related to creation of an ItemReader for Excel. anyway, I'm not sure but I think I cannot use FlatFileItemReader for excel files specially since the Excel file that I need to parse contains multiple tabs.

Comment

You see I'm a bit confused on how the read() method is called. I'm trying to parse an Excel file. I pass the excel file name and sheet name to the constructor of the CustomItemReader I created (which extends FlatFileItemReader and implements ItemReader). In the constructor of my CustomItemReader, I used JExcelAPI to load the worksheet whose data I needed to process, in the read() method, I'm supposed to return the contents of each row. However, for some reason, the read() method is not called at all. I'm at a loss on why this happens. Please see below my job configuration and my classes:

Comment

I don't understand why you extend FlatFileItemReader. There isn't much point it you are not reading a flat file. Also your implementation of ItemReader is not honouring the reset() and mark() contract (so rollbacks will not work). And it doesn't implement ItemStream with the index of your row list, so it isn't restartable.

Other than that I can't see any issues explaining why read() is not working. How did you launch the job?