Appendix C. Batch Processing and Transactions

C.1 Simple Batching with No Retry

Consider the following simple example of a nested batch with no
retries. This is a very common scenario for batch processing, where
an input source is processed until exhausted, but we commit
periodically at the end of a "chunk" of processing.

The input operation (3.1) could be a message-based receive
(e.g. JMS), or a file-based read, but to recover and continue
processing with a chance of completing the whole job, it must be
transactional. The same applies to the operation at (3.2) - it must
be either transactional or idempotent.

If the chunk at REPEAT(3) fails because of a database exception at
(3.2), then TX(2) will roll back the whole chunk.

C.2 Simple Stateless Retry

It is also useful to use a retry for an operation which is not
transactional, like a call to a web-service or other remote
resource. For example:

This is actually one of the most useful applications of a retry,
since a remote call is much more likely to fail and be retryable
than a database update. As long as the remote access (2.1)
eventually succeeds, the transaction TX(0) will commit. If the
remote access (2.1) eventually fails, then the transaction TX(0) is
guaranteed to roll back.

C.3 Typical Repeat-Retry Pattern

The most typical batch processing pattern is to add a retry to the
inner block of the chunk in the Simple Batching example.
Consider this:

The inner RETRY(4) block is marked as "stateful" - see the
typical use case for a description of a stateful
retry. This means that if the the retry PROCESS(5) block fails, the
behaviour of the RETRY(4) is as follows.

Throw an exception, rolling back the transaction TX(2) at the
chunk level, and allowing the item to be re-presented to the input
queue.

When the item re-appears, it might be retried depending on the
retry policy in place, executing PROCESS(5) again. The second and
subsequent attempts might fail again and rethrow the exception.

Eventually the item re-appears for the final time: the retry
policy disallows another attempt, so PROCESS(5) is never
executed. In this case we follow a RECOVER(6) path, effectively
"skipping" the item that was received and is being processed.

Notice that the notation used for the RETRY(4) in the plan above
shows explictly that the the input step (4.1) is part of the retry.
It also makes clear that there are two alternate paths for
processing: the normal case is denoted by PROCESS(5), and the
recovery path is a separate block, RECOVER(6). The two alternate
paths are completely distinct: only one is ever taken in normal
circumstances.

In special cases (e.g. a special TranscationValidException
type), the retry policy might be able to determine that the
RECOVER(6) path can be taken on the last attempt after PROCESS(5)
has just failed, instead of waiting for the item to be re-presented.
This is not the default behavior because it requires detailed
knowledge of what has happened inside the PROCESS(5) block, which is
not usually available - e.g. if the output included write
access before the failure, then the exception should be rethrown to
ensure transactional integrity.

The completion policy in the outer, REPEAT(1) is crucial to the
success of the above plan. If the output(5.1) fails it may throw an
exception (it usually does, as described), in which case the
transaction TX(2) fails and the exception could propagate up through
the outer batch REPEAT(1). We do not want the whole batch to stop
because the RETRY(4) might still be successful if we try again, so
we add the exception=not critical to the outer REPEAT(1).

Note, however, that if the TX(2) fails and we do try again, by
virtue of the outer completion policy, the item that is next
processed in the inner REPEAT(3) is not guaranteed to be the one
that just failed. It might well be, but it depends on the
implementation of the input(4.1). Thus the output(5.1) might fail
again, on a new item, or on the old one. The client of the batch
should not assume that each RETRY(4) attempt is going to process the
same items as the last one that failed. E.g. if the termination
policy for REPEAT(1) is to fail after 10 attempts, it will fail
after 10 consecutive attempts, but not necessarily at the same item.
This is consistent with the overall retry strategy: it is the inner
RETRY(4) that is aware of the history of each item, and can decide
whether or not to have another attempt at it.

C.4 Asynchronous Chunk Processing

The inner batches or chunks in the typical example
above can be executed concurrently by configuring the outer batch to
use an AsyncTaskExecutor. The outer batch waits for all the
chunks to complete before completing.

C.5 Asynchronous Item Processing

The individual items in chunks in the typical
can also in principle be processed concurrently. In this case the
transaction boundary has to move to the level of the individual
item, so that each transaction is on a single thread:

This plan sacrifices the optimisation benefit, that the simple plan
had, of having all the transactional resources chunked together. It
is only useful if the cost of the processing (5) is much higher than
the cost of transaction management (3).

C.6 Interactions Between Batching and Transaction Propagation

There is a tighter coupling between batch-retry and TX management
than we would ideally like. In particular a stateless retry cannot
be used to retry database operations with a transaction manager that
doesn't support NESTED propagation.

Now if TX(3) rolls back it can pollute the whole batch at TX(1) and
force it to roll back at the end.

What about non-default propagation?

In the last example PROPAGATION_REQUIRES_NEW at TX(3) will
prevent the outer TX(1) from being polluted if both transactions
are eventually successful. But if TX(3) commits and TX(1) rolls
back, then TX(3) stays committed, so we violate the transaction
contract for TX(1). If TX(3) rolls back, TX(1) does not necessarily (but it probably
will in practice because the retry will throw a roll back
exception).

PROPAGATION_NESTED at TX(3) works as we require in the retry
case (and for a batch with skips): TX(3) can commit, but
subsequently be rolled back by the outer transaction TX(1). If
TX(3) rolls back, again TX(1) will roll back in practice. This
option is only available on some platforms, e.g. not Hibernate or
JTA, but it is the only one that works consistently.

So NESTED is best if the retry block contains any database access.

C.7 Special Case: Transactions with Orthogonal Resources

Default propagation is always OK for simple cases where there are no
nested database transactions. Consider this (where the SESSION and
TX are not global XA resources, so their resources are orthogonal):

Here there is a transactional message SESSION(0), but it doesn't
participate in other transactions with
PlatformTransactionManager, so doesn't propagate when TX(3)
starts. There is no database access outside the RETRY(2) block. If
TX(3) fails and then eventually succeeds on a retry, SESSION(0) can
commit (it can do this independent of a TX block). This is similar
to the vanilla "best-efforts-one-phase-commit" scenario - the worst
that can happen is a duplicate message when the RETRY(2) succeeds
and the SESSION(0) cannot commit, e.g. because the message system is
unavailable.

C.8 Stateless Retry Cannot Recover

The distinction between a stateless and a stateful retry in the
typical example above is important. It is actually
ultimately a transactional constraint that forces the distinction,
and this constraint also makes it obvious why the distinction
exists.

We start with the observation that there is no way to skip an item
that failed and successfully commit the rest of the chunk unless we
wrap the item processing in a transaction. So we simplify the
typical batch execution plan to look like this:

Here we have a stateless RETRY(3) with a RECOVER(5) path that kicks
in after the final attempt fails. The "stateless" label just means
that the block will be repeated without rethrowing any exception up
to some limit. This will only work if the transaction TX(4) has
propagation NESTED.

If the TX(3) has default propagation properties and it rolls back,
it will pollute the outer TX(1). The inner transaction is assumed by
the transaction manager to have corrupted the transactional
resource, and so it cannot be used again.

Support for NESTED propagation is sufficiently rare that we choose
not to support recovery with stateless retries in current versions of
Spring Batch. The same effect can always be achieved (at the
expense of repeating more processing) using the
typical pattern above.