This forum is now a read-only archive. All commenting, posting, registration services have been turned off. Those needing community support and/or wanting to ask questions should refer to the Tag/Forum map, and to http://spring.io/questions for a curated list of stackoverflow tags that Pivotal engineers, and the community, monitor.

Using in memory repository or in memory database

May 2nd, 2011, 11:54 AM

Hello,

We are planning for a new big proyect and so far we are leaning towards using Spring Batch because it is amazingly strong and complete.

We have a functional requirement to process items 1 by 1 in some cases, and after doing some tests we found out that the "overhead" of using a database jobrepository while processing single items its way too much, all the inserts/updates done to BATCH_ tables increase the processing time considerably and definitely that is not something we want.

Our idea right now is to use a master step to partition our data and then process it with N threads, our partitioner will assign just 1 record to each partition hence our slave step will process 1 record at a time. Is this the correct approach? Or is there any other "best practice" that we should follow here?

If we do not care that the job is aware of the jobs/steps state when we restart the vm is there any other drawback of using the inmemory job repository instead of the database one?

Also we were thinking if there is any bennefit/difference between using the inMemory job repository and setting up an inmemory database and use the database job repository pointing to the inmemory database?

If there is indeed a beneffit from using an in memory database, is there any that is proved to work best with Spring Batch?

Our idea right now is to use a master step to partition our data and then process it with N threads, our partitioner will assign just 1 record to each partition hence our slave step will process 1 record at a time. Is this the correct approach? Or is there any other "best practice" that we should follow here?

usually the partitioner defines ranges of records, each range would be then processed in a dedicated step execution.

Also we were thinking if there is any bennefit/difference between using the inMemory job repository and setting up an inmemory database and use the database job repository pointing to the inmemory database?

the in-memory job repository isn't meant to be used in production. It lacks some concurrency support for example. You should try to use the persistent job repository with an in-memory database, like H2. It's safer and shouldn't add too much overhead.

Comment

Lack of concurrency support is enough to take the in memory database route, i will test that for a while and see how it goes.

We will partition our data in bigger chunks but our worst case scenario is when we need to partition them in chunks of size 1 so we are analyzing this based on that. And you are right we will have a dedicated slave step to process each chunk.

One more question is there a way to pass parameters between steps? or between the job and the steps? So far we think you can only pass String, Long, Date and Float parameters but what about a custom object?

So with that in mind we probably will end up writing the custom object into a file and then pass the line number as argument to the next step. Is this ok or is there any Spring Batch support to make this easier/better?

Comment

We will partition our data in bigger chunks but our worst case scenario is when we need to partition them in chunks of size 1 so we are analyzing this based on that. And you are right we will have a dedicated slave step to process each chunk.

there's no relation between the partition size and the chunk size. The size of a partition can be 10K items and the chunk size for a step can be 1, it doesn't matter.

One more question is there a way to pass parameters between steps? or between the job and the steps? So far we think you can only pass String, Long, Date and Float parameters but what about a custom object?

you can indeed use the execution context. You can also use dependency injection: define an holder class, define a bean, and inject it the batch artifacts (readers, writers, etc.) that need the information. A step would set the value and a step would read it downstream.

Comment

Setting parameters in the JobExecutionContext works great to pass data between steps but when you try to use parallel processing by using a partitoner and a partitionHandler you have a problem because you cant access the Context from the partitioner (or at least i dont know how to access it).

Now if I use dependency injection to pass the data to the partitioner it works great as long as i only have 1 instance of the same Job running, if i have more than 1 instance all of them will share the same partitioner object (due to spring singletons) and the the partitioner parameters will be overwritten and mixed between each job instance.

I thougth about injecting a map with the parameter using as mapkey the jobName and jobExecutionId that way i can control the parameters that belong to each instance of my job. But again this gets me to the same deadlock of the partitioner not being able to access the executionContext.

Any ideas?

Right now we are thinking about having a previous step that will setup all the required data by the partitioner to split the jobs and save it in a database table then the partitioner will get the data from the DB and partition the job.