we need some rpc retry framework

Details

Description

We need some mechanism for RPC calls that get exceptions to automatically retry the call under certain circumstances. In particular, we often end up with calls to rpcs being wrapped with retry loops for timeouts. We should be able to make a retrying proxy that will call the rpc and retry in some circumstances.

The weakness of this approach would be (a) the client would need to know how to build an appropriate proxy for the protocol, and (b) specifying different reliability contracts for different methods would be awkward. By contrast, your annotation-based approach keeps the reliability requirements with the protocol and per-method.

Doug Cutting
added a comment - 16/Oct/06 19:05 This looks good to me. +1
For comparison, the proxy approach might look something like:
XxxProtocol proxy = new RetryProxy(XxxProtocol.class, Try.FOREVER);
The weakness of this approach would be (a) the client would need to know how to build an appropriate proxy for the protocol, and (b) specifying different reliability contracts for different methods would be awkward. By contrast, your annotation-based approach keeps the reliability requirements with the protocol and per-method.

Whenever the jobtracker is busy and it makes our JobClient.submitJob calls fail.
This in turn means a whole chain of jobs that depends on that submit fails, it's a big issue when it comes to using hadoop in production.

Johan Oskarsson
added a comment - 15/Feb/07 10:27 Any news on this one?
Whenever the jobtracker is busy and it makes our JobClient.submitJob calls fail.
This in turn means a whole chain of jobs that depends on that submit fails, it's a big issue when it comes to using hadoop in production.

Johan Oskarsson
added a comment - 19/Feb/07 17:26 Created a draft patch to solve this issue more or less as suggested by Owen.
I left out the Try option (try once, no exception) since it's not possible to return null if the proxied method has a primitive return value.
This patch just contains the annotations and retry bits, I'm going to leave it up to someone with more intimate knowledge of hadoop to decide what retry policy goes where.
Suggestions and corrections welcome.
Example usage:
@RpcReliability(RetryPolicy.FOREVER)
public void doSomething() throws IOException;
@RpcReliability(RetryPolicy.TRY_NOTIFY)
public void doSomething() throws IOException;
@RpcReliability(value=RetryPolicy.TIMED, timedDelaySecs=5, timedTrySecs=30)
public int doSomething() throws IOException;
@RpcReliability(value=RetryPolicy.TIMED, timedDelaySecs=5, timedTrySecs=30,
handlers=
{@ErrorHandler(err=SocketTimeoutException.class, response=RetryPolicy.FOREVER),
@ErrorHandler(err=NullPointerException.class, response=RetryPolicy.TRY_NOTIFY)}
)
public int doSomething() throws IOException;

Thanks Johan. It would be good to add the annotations handling from your patch to the mechanism in HADOOP-997. This is probably best done after HADOOP-997 is committed, as one of a series of follow up Jiras to introduce retries in various parts of Hadoop. Note that if you use annotations you can't get parameters from the configuration file (unless anyone knows of a way to do this?)

Also, I've put the retry classes in a new package org.apache.hadoop.io.retry. Do people feel this is OK? It doesn't feel quite right as it's not exclusively for IO retries.

Tom White
added a comment - 19/Feb/07 20:04 Thanks Johan. It would be good to add the annotations handling from your patch to the mechanism in HADOOP-997 . This is probably best done after HADOOP-997 is committed, as one of a series of follow up Jiras to introduce retries in various parts of Hadoop. Note that if you use annotations you can't get parameters from the configuration file (unless anyone knows of a way to do this?)
Also, I've put the retry classes in a new package org.apache.hadoop.io.retry. Do people feel this is OK? It doesn't feel quite right as it's not exclusively for IO retries.