Description

In HADOOP-13786 I'm adding a custom subclass for FileOutputFormat, one which can talk direct to the S3A Filesystem for more efficient operations, better failure modes, and, most critically, as part of HADOOP-13345, atomic commit of output. The normal committer relies on directory rename() being atomic for this; for S3 we don't have that luxury.

To support a custom committer, we need to be able to tell FileOutputFormat (and implicitly, all subclasses which don't have their own custom committer), to use our new S3AOutputCommitter.

I propose:

FileOutputFormat takes a factory to create committers.

The factory to take a URI and TaskAttemptContext and return a committer

the default implementation always returns a FileOutputCommitter

A configuration option allows a new factory to be named

An S3AOutputCommitterFactory to return a FileOutputCommitter or new S3AOutputCommitter depending upon the URI of the destination.

Note that MRv1 already supports configurable committers; this is only the V2 API

Activity

This is the initial HADOOP-13786 001 PoC patch, to give the MR bit of the code some testing too. It adds a new factory for FileOutputFormatter to use when creating committers; the default one returns FileOutputCommitter instances as normal; a special S3a one in hadoop-aws to handle S3a specific operations.

Now, the other way to do this (given the need to keep the s3a code in the s3a module) would be to allow a notion of a new algorithm, one which relayed to an implementation of an interface. That would hand a problem not addressed here: how to address subclasses of FileOutputFormat with custom subclasses of FileOutputCommitter, and make it easier to add committers for other non-FS-destinations, namely the other object stores.

However, it would be a more significant change to FileOutputCommitter; I could go that way, but it'd need support before I started.

Steve Loughran
added a comment - 16/Dec/16 19:41 This is the initial HADOOP-13786 001 PoC patch, to give the MR bit of the code some testing too. It adds a new factory for FileOutputFormatter to use when creating committers; the default one returns FileOutputCommitter instances as normal; a special S3a one in hadoop-aws to handle S3a specific operations.
Now, the other way to do this (given the need to keep the s3a code in the s3a module) would be to allow a notion of a new algorithm, one which relayed to an implementation of an interface. That would hand a problem not addressed here: how to address subclasses of FileOutputFormat with custom subclasses of FileOutputCommitter , and make it easier to add committers for other non-FS-destinations, namely the other object stores.
However, it would be a more significant change to FileOutputCommitter ; I could go that way, but it'd need support before I started.

Cancelling this PoC; redesigning. In order to support existing subclasses of FOF (e.g. the Parquet one); we'll have to come in lower.

I propose adding a new algorithm, "3", which really means "plug in a new committer of classname X", with another property to define that classname. We can then add an s3 committer which supports this new protocol.

This does mean that we will need to define a committer plugin...that we can declare as unstable/limited private, and implement the s3a one

Steve Loughran
added a comment - 09/Jan/17 13:52 Cancelling this PoC; redesigning. In order to support existing subclasses of FOF (e.g. the Parquet one); we'll have to come in lower.
I propose adding a new algorithm, "3", which really means "plug in a new committer of classname X", with another property to define that classname. We can then add an s3 committer which supports this new protocol.
This does mean that we will need to define a committer plugin...that we can declare as unstable/limited private, and implement the s3a one

*2017/06/23 update* no, that's just messy. Best to find when those committers are used and allow them to be more generic. Example: all the parquet one does is add an optional schema summary file. If you don't want that, any FOF committer can be used

Steve Loughran
added a comment - 23/Jun/17 18:11
* 2017/06/23 update * no, that's just messy. Best to find when those committers are used and allow them to be more generic. Example: all the parquet one does is add an optional schema summary file. If you don't want that, any FOF committer can be used
Resubmitting the original patch, as it stands, from HADOOP-13786