Details

Description

A goal of this code is "support O(1) commits to S3 repositories in the presence of failures". Implement it, including whatever is needed to demonstrate the correctness of the algorithm. (that is, assuming that s3guard provides a consistent view of the presence/absence of blobs, show that we can commit directly).

I consider ourselves free to expose the blobstore-ness of the s3 output streams (ie. not visible until the close()), if we need to use that to allow us to abort commit operations.

Activity

One thing to consider: an atomic PUT-no-overwrite can be used for speculative commits of a single file; I'm not so confident that it can be used for any task writing more than one file: it's no longer a single atomic commit at the end of the task. There's also the little detail which the output committer code in mapred.Task assumes that work is not-committed until the final communication with the AM; we'll need to make sure that cleanup always takes place. Maybe the

Steve Loughran
added a comment - 07/Nov/16 13:33 One thing to consider: an atomic PUT-no-overwrite can be used for speculative commits of a single file; I'm not so confident that it can be used for any task writing more than one file: it's no longer a single atomic commit at the end of the task. There's also the little detail which the output committer code in mapred.Task assumes that work is not-committed until the final communication with the AM; we'll need to make sure that cleanup always takes place. Maybe the

Looking at the atomic commit problem, I'm starting to worry that I may need some directory leasing, which would imply a lock field in the table. Just warning you there. The problem is that while a rename is in progress, you wan't to stop anything trying to PUT data anywhere under the two paths being worked on.

Steve Loughran
added a comment - 29/Nov/16 15:43
Looking at the atomic commit problem, I'm starting to worry that I may need some directory leasing, which would imply a lock field in the table. Just warning you there. The problem is that while a rename is in progress, you wan't to stop anything trying to PUT data anywhere under the two paths being worked on.

I've been tweaking the mapreduce code to add ability to switch to a subclass of FileOutputCommitter in FileOutputFormat (And transitively, all children), by way of a factory. The S3a factory dynamically chooses the committer based on the destination FS. There's currently no difference in the codebase apart from logging of operation durations. This means that we can switch the committer behind all output formats to using the s3a one (and also allowing anyone else to do the same with their own FOF committer subclass).

I can see that ParquetOutputFormat has its own committer, ParquetOutputCommitter, as used by ParquetOutputFormat and so, I believe spark's dataframe.parquet.write() pipeline; my IDE isn't highlighting much else.

Now, could I get away with just modifying the base FOF committer?

for exception reporting it could use FileSystem.rename(final Path src, final Path dst, final Rename... options); S3A to impl that with exceptions. That's the "transitory" method for use between FS and FileContext, which does raise exceptions; the default is non-atomic and eventually gets to rename(src, dest), except for HDFS which does a full implementation. But: what if the semantics of that rename() (not yet in the FS spec, AFAIK) are different from what callers expect in some of the corner cases? And so commits everywhere break? That would not be good. And it would remove the ability to use any private s3a/s3guard calls if we wanted some (e.g. get a lock on a directory), unless they went in to the standard FS APIs.

Making the output factory pluggable would avoid such problems, though logging duration might be nice all round. That said, given the time to rename, its somewhat unimportant when using real filesystems.

Steve Loughran
added a comment - 29/Nov/16 19:51 For the curious, my WiP, in the s3guard branch, is here: https://github.com/steveloughran/hadoop/tree/s3guard/HADOOP-13786-committer
I've been tweaking the mapreduce code to add ability to switch to a subclass of FileOutputCommitter in FileOutputFormat (And transitively, all children), by way of a factory. The S3a factory dynamically chooses the committer based on the destination FS. There's currently no difference in the codebase apart from logging of operation durations. This means that we can switch the committer behind all output formats to using the s3a one (and also allowing anyone else to do the same with their own FOF committer subclass).
I can see that ParquetOutputFormat has its own committer, ParquetOutputCommitter , as used by ParquetOutputFormat and so, I believe spark's dataframe.parquet.write() pipeline; my IDE isn't highlighting much else.
Now, could I get away with just modifying the base FOF committer?
for exception reporting it could use FileSystem.rename(final Path src, final Path dst, final Rename... options) ; S3A to impl that with exceptions. That's the "transitory" method for use between FS and FileContext, which does raise exceptions; the default is non-atomic and eventually gets to rename(src, dest) , except for HDFS which does a full implementation. But: what if the semantics of that rename() (not yet in the FS spec, AFAIK) are different from what callers expect in some of the corner cases? And so commits everywhere break? That would not be good. And it would remove the ability to use any private s3a/s3guard calls if we wanted some (e.g. get a lock on a directory), unless they went in to the standard FS APIs.
Making the output factory pluggable would avoid such problems, though logging duration might be nice all round. That said, given the time to rename, its somewhat unimportant when using real filesystems.

Update: the way s3a does commits via mergeOrUpdate() isn't really that atomic. It's atomic per destination rename, but as it treewalks doing a merge, there are actual potentials for races/inconsistent outcomes. What s3guard is at least guarantee that the view of the FS during that walk is consistent across clients.

I'm checking in/pushing up an update, where I've been slowly moving up to some s3a internals, more for performance than consistency.

Examples:

switch to innerDelete(FileStatus) for the delete, adding a parameter there to allow the check for maybe creating a parent dir to be skipped. As we know there's about be a new child, skipping that avoids 1-3 HEAD calls and a PUT which will soon be deleted

switch to an implementation of FileSystem.rename/3. This raises exceptions and lets us choose overwrite policy (good), but it still does two getFileStatusCalls, one for the src and one for the dest. As we have the source and have just deleted the test, we don't need them again. Having an @Privat method which took the source and dest values would strip out another 2-6 HTTP requests. Admittedly, on a large commit the cost of his preamble is low —the COPY becomes the expense.

Steve Loughran
added a comment - 02/Dec/16 15:19 Update: the way s3a does commits via mergeOrUpdate() isn't really that atomic. It's atomic per destination rename, but as it treewalks doing a merge, there are actual potentials for races/inconsistent outcomes. What s3guard is at least guarantee that the view of the FS during that walk is consistent across clients.
I'm checking in/pushing up an update, where I've been slowly moving up to some s3a internals, more for performance than consistency.
Examples:
switch to innerDelete(FileStatus) for the delete, adding a parameter there to allow the check for maybe creating a parent dir to be skipped. As we know there's about be a new child, skipping that avoids 1-3 HEAD calls and a PUT which will soon be deleted
switch to an implementation of FileSystem.rename/3 . This raises exceptions and lets us choose overwrite policy (good), but it still does two getFileStatusCalls, one for the src and one for the dest. As we have the source and have just deleted the test, we don't need them again. Having an @Privat method which took the source and dest values would strip out another 2-6 HTTP requests. Admittedly, on a large commit the cost of his preamble is low —the COPY becomes the expense.

A few of us (Thomas, Pieter, Mingliang, Aaron and Sean) had a quick conf call earlier this week, where Thomas and Pieter outlined their proposed algorithm for implementing zero-rename commits to any consistent S3 endpoint.

I've written up my interpretation of the algorithm for review, looking though the Hadoop and Spark commit code to see what appears to be going on, though I do need to actually document the various algorithms better.

Steve Loughran
added a comment - 16/Dec/16 17:49 A few of us (Thomas, Pieter, Mingliang, Aaron and Sean) had a quick conf call earlier this week, where Thomas and Pieter outlined their proposed algorithm for implementing zero-rename commits to any consistent S3 endpoint.
I've written up my interpretation of the algorithm for review, looking though the Hadoop and Spark commit code to see what appears to be going on, though I do need to actually document the various algorithms better.
Comments welcome, especially those containing proofs of correctness

Thomas Demoor
added a comment - 16/Dec/16 18:03 Had my write-up ready but you beat me too actually submitting Steve Loughran
Filed HADOOP-13912 so we can track this separately, but I think it should go on top on what you have already done in the current ticket.

modified FileOutputFormat to allow the committer to be chosen (as MRv1 did). Here's actually a factory which can be defined, so that you can have a committer factory which chooses the committer based on destination FS.

There's an S3 factory, and a committer.

The committer has some special methods in the FS API allowing it to bypass a lot of the checks before operations which the full Hadoop FS API requires. Here we can assume that the committer knows that the destination isn't a directory, doesn't want a mock parent directory created after deleting a path, etc.

ITestS3AOutputCommitter is a clone of TestFileOutputCommitter, reworked to be against S3. Note it could be used as a basis for testing commits to other filesystems; the basic one assumes local FS.

Some of the new tests are failing; I haven't completely weaned to the new tests off file:// and into being able to simulate different failures of (a subclass of) s3.

Steve Loughran
added a comment - 16/Dec/16 18:28 Patch 001
modified FileOutputFormat to allow the committer to be chosen (as MRv1 did). Here's actually a factory which can be defined, so that you can have a committer factory which chooses the committer based on destination FS.
There's an S3 factory, and a committer.
The committer has some special methods in the FS API allowing it to bypass a lot of the checks before operations which the full Hadoop FS API requires. Here we can assume that the committer knows that the destination isn't a directory, doesn't want a mock parent directory created after deleting a path, etc.
ITestS3AOutputCommitter is a clone of TestFileOutputCommitter , reworked to be against S3. Note it could be used as a basis for testing commits to other filesystems; the basic one assumes local FS.
Some of the new tests are failing; I haven't completely weaned to the new tests off file:// and into being able to simulate different failures of (a subclass of) s3.

no worries; I had to write everything down to understand it more, look at the issues and estimate costs. I need to really define what the different committers do too, to make sure that I understand what they are doing and where.

Steve Loughran
added a comment - 16/Dec/16 18:30 no worries; I had to write everything down to understand it more, look at the issues and estimate costs. I need to really define what the different committers do too, to make sure that I understand what they are doing and where .

Steve Loughran
added a comment - 09/Jan/17 17:35 Making this standalone, giving title but making clear it's for any consistent endpoint. And about to discard all work in patch 1 except the algorithm doc

Object stores do not have an efficient rename operation, which is used by the Hadoop FileOutputCommitter to atomically promote the "winning" attempt out of the multiple (speculative) attempts to the final path. These slow job commits are one of the main friction points when using object stores in Hadoop.There have been quite some attempts at resolving this: HADOOP-9565Link, Apache Spark DirectOutputCommitters, ... but they have proven not to be robust in face of adversity (network partitions, ...).

The current ticket proposes to do the atomic commit by using the S3 Multipart API, which allows multiple concurrent uploads on the same objectname, each in its own "temporary space, identified by the UploadId which is returned as a response to InitiateMultipartUpload. Every attempt writes directly to the final outputPath. Data is uploaded using Put Part and as a response an ETag for the part is returned and stored. The CompleteMultipartUpload is postponed. Instead, we persist the UploadId (using a _temporary subdir or elsewhere) and the ETags. When a certain "job" wins CompleteMultipartUpload is called for each of its files using the proper list of Part ETags.

Completing a MultipartUpload is a metadata only operation (internally in S3) and is thus orders of magnitude faster than the rename-based approach which moves all the data.

Steve Loughran
added a comment - 11/Jan/17 14:11 Multipart algorithm summarised by Thomas Demoor on HADOOP-13912
Object stores do not have an efficient rename operation, which is used by the Hadoop FileOutputCommitter to atomically promote the "winning" attempt out of the multiple (speculative) attempts to the final path. These slow job commits are one of the main friction points when using object stores in Hadoop.There have been quite some attempts at resolving this: HADOOP-9565 Link, Apache Spark DirectOutputCommitters, ... but they have proven not to be robust in face of adversity (network partitions, ...).
The current ticket proposes to do the atomic commit by using the S3 Multipart API, which allows multiple concurrent uploads on the same objectname, each in its own "temporary space, identified by the UploadId which is returned as a response to InitiateMultipartUpload. Every attempt writes directly to the final outputPath. Data is uploaded using Put Part and as a response an ETag for the part is returned and stored. The CompleteMultipartUpload is postponed. Instead, we persist the UploadId (using a _temporary subdir or elsewhere) and the ETags. When a certain "job" wins CompleteMultipartUpload is called for each of its files using the proper list of Part ETags.
Completing a MultipartUpload is a metadata only operation (internally in S3) and is thus orders of magnitude faster than the rename-based approach which moves all the data.

Steve Loughran
added a comment - 11/Jan/17 14:43 Final committer will need to handle speculative commits to same destination path.
individual executors will need to write to job-specific subdir of a .temp_multipart_put dir
central job committer to only commit work of tasks known to have completed
...and for those that did not complete: rm the pending multipart blocks by reading etags and deleting the pending blocks
ideally: track destination filenames and catch conflict with >1 task generating same dest file.

Steve Loughran
added a comment - 27/Jan/17 18:46 Patch 002
This patch
defines the mechanism for using multipart uploads for commits
implements it
adds tests
has tests working, including a scale one.
It has not wired this up to MRv1/v2 output committers & FileOutputFormat, though the changes have been made to the MRv2 code to make it possible to place the S3A committer in behind FileOutputFormat.
What works, then? Dela
create a file with a path /example/__pending/job1/task1/part-001.bin
this will initiate an MPU to /example/part001.bin , to which all the output goes.
when the output stream is closed, the file /example/__pending/job1/task1/part-001.bin.pending is created, which saves everything needed to commit the job
later a FileCommitActions class can be created.
FileCommitActions .commitAllPendingFilesInPath() will scan a dir for .pending entries, and commit them one by one, here commitAllPendingFilesInPath("/example/__pending/job1/task1/")
..which causes the file /example/part001.bin to come into existence.
or, if you call abortAllPendingFilesInPath(...) the MPUs are read and aborted.
Performance? < 1s to commit a single 128MB file over a long-haul link.
Duration of time to commit s3a: //hwdev-steve-frankfurt- new /tests3ascale/scale/commit/__pending/job_001/commit.bin.pending: 688,701,514 nS
I think that's pretty good

Private method org.apache.hadoop.fs.s3a.commit.DelayedCompleteS3AIntegration.isDelayedCompletePath(Path) is never called At DelayedCompleteS3AIntegration.java:called At DelayedCompleteS3AIntegration.java:[lines 114-115]

Using pointer equality to compare a java.util.ArrayList<org.apache.hadoop.fs.s3a.commit.FileCommitActions$CommitFileOutcome> with a FileCommitActions$CommitFileOutcome in org.apache.hadoop.fs.s3a.commit.FileCommitActions$CommitAllFilesOutcome.add(FileCommitActions$CommitFileOutcome) At FileCommitActions.java:a java.util.ArrayList<org.apache.hadoop.fs.s3a.commit.FileCommitActions$CommitFileOutcome> with a FileCommitActions$CommitFileOutcome in org.apache.hadoop.fs.s3a.commit.FileCommitActions$CommitAllFilesOutcome.add(FileCommitActions$CommitFileOutcome) At FileCommitActions.java:[line 186]

Should org.apache.hadoop.fs.s3a.commit.S3AOutputCommitter$DurationInfo be a static inner class? At S3AOutputCommitter.java:inner class? At S3AOutputCommitter.java:[lines 126-140]

java.util.ServiceConfigurationError: org.apache.hadoop.security.token.TokenIdentifier: Provider org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier not found
at java.util.ServiceLoader.fail(ServiceLoader.java:239)
at java.util.ServiceLoader.access$300(ServiceLoader.java:185)
at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:372)
at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
at org.apache.hadoop.security.token.Token.getClassForIdentifier(Token.java:149)
at org.apache.hadoop.security.token.Token.decodeIdentifier(Token.java:170)
at org.apache.hadoop.security.token.Token.identifierToString(Token.java:425)
at org.apache.hadoop.security.token.Token.toString(Token.java:445)
at java.lang.String.valueOf(String.java:2994)
at java.lang.StringBuilder.append(StringBuilder.java:131)
at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:143)
at org.apache.hadoop.mapreduce.security.TestTokenCache.testBinaryCredentials(TestTokenCache.java:116)
at org.apache.hadoop.mapreduce.security.TestTokenCache.testBinaryCredentialsWithoutScheme(TestTokenCache.java:67)

otherwise, there isn't enough test coverage, or even the MR integration, for this to be ready to think about committing. I've stuck it up to give people progress.

Key point: we can use delayed MPU completion for 0-rename writes, caching the information needed for the operation in S3 itself. We are yet to show that this works for MR code

Steve Loughran
added a comment - 28/Jan/17 13:25 MR test failure assumed spurious:
java.util.ServiceConfigurationError: org.apache.hadoop.security.token.TokenIdentifier: Provider org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier not found
at java.util.ServiceLoader.fail(ServiceLoader.java:239)
at java.util.ServiceLoader.access$300(ServiceLoader.java:185)
at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:372)
at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
at org.apache.hadoop.security.token.Token.getClassForIdentifier(Token.java:149)
at org.apache.hadoop.security.token.Token.decodeIdentifier(Token.java:170)
at org.apache.hadoop.security.token.Token.identifierToString(Token.java:425)
at org.apache.hadoop.security.token.Token.toString(Token.java:445)
at java.lang. String .valueOf( String .java:2994)
at java.lang.StringBuilder.append(StringBuilder.java:131)
at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:143)
at org.apache.hadoop.mapreduce.security.TestTokenCache.testBinaryCredentials(TestTokenCache.java:116)
at org.apache.hadoop.mapreduce.security.TestTokenCache.testBinaryCredentialsWithoutScheme(TestTokenCache.java:67)
otherwise, there isn't enough test coverage, or even the MR integration, for this to be ready to think about committing. I've stuck it up to give people progress.
Key point: we can use delayed MPU completion for 0-rename writes, caching the information needed for the operation in S3 itself. We are yet to show that this works for MR code

Steve Loughran
added a comment - 30/Jan/17 21:18 Patch 003. This
implements the s3a committer
has both the commitPath and abortPath succeed if the path doesn't exist. This avoids having to do an existence check before invocation.
adds tests for the abort
plays around with the policy as to "what do do on an abort if the file named isn't there". It now catches the exception and ignores it unless told otherwise.
has the pom setup for writing the MiniMRYarnCluster test.
The next task is that test setup; once that is done we can try to generate different situations: parallel runs, failures, etc.

Using pointer equality to compare a java.util.ArrayList<org.apache.hadoop.fs.s3a.commit.FileCommitActions$CommitFileOutcome> with a FileCommitActions$CommitFileOutcome in org.apache.hadoop.fs.s3a.commit.FileCommitActions$CommitAllFilesOutcome.add(FileCommitActions$CommitFileOutcome) At FileCommitActions.java:a java.util.ArrayList<org.apache.hadoop.fs.s3a.commit.FileCommitActions$CommitFileOutcome> with a FileCommitActions$CommitFileOutcome in org.apache.hadoop.fs.s3a.commit.FileCommitActions$CommitAllFilesOutcome.add(FileCommitActions$CommitFileOutcome) At FileCommitActions.java:[line 215]

Should org.apache.hadoop.fs.s3a.commit.S3AOutputCommitter$DurationInfo be a static inner class? At S3AOutputCommitter.java:inner class? At S3AOutputCommitter.java:[lines 289-306]

this is passing the tests in a suite derived from org.apache.hadoop.mapreduce.lib.output.TestFileOutputCommitter;still looking at ways to simulate failure conditions and semantics of failure we want.

Essentially: once a pending commit has happened, there is no retry. Meaning: when a task has committed once, it should fail from then on, which it does with an FNFE on the task attempt dir.

Similarly you can only commit a job once, even if all the job does is delete all child directories.

One change in this patch is the need to support pending subtrees, eg. map output to the directory part-0000/index and part-0000/data in the destination dir; this has been done by adding the notion of a _base path element in the pending tree. When a base path is a parent. the destination path is the parent of the __pending dir, with all children under base retained. With each task attempt dir being dest/pending/$app/$app-attempt/$task_attempt/_base, this ensures that all data created in the task working dir ends up under the destination in the same directory tree.

issues:

what about cleaning up __pending? Job commit?

need to stop someone creating a path _base/_pending and so sneak in pending stuff/get very confused. Actually, stop __pending under __pending.

Steve Loughran
added a comment - 02/Feb/17 19:56 Patch 004
this is passing the tests in a suite derived from org.apache.hadoop.mapreduce.lib.output.TestFileOutputCommitter ;still looking at ways to simulate failure conditions and semantics of failure we want.
Essentially: once a pending commit has happened, there is no retry . Meaning: when a task has committed once, it should fail from then on, which it does with an FNFE on the task attempt dir.
Similarly you can only commit a job once, even if all the job does is delete all child directories.
One change in this patch is the need to support pending subtrees, eg. map output to the directory part-0000/index and part-0000/data in the destination dir; this has been done by adding the notion of a _ base path element in the pending tree. When a base path is a parent. the destination path is the parent of the __pending dir, with all children under base retained. With each task attempt dir being dest/ pending/$app/$app-attempt/$task_attempt/ _base , this ensures that all data created in the task working dir ends up under the destination in the same directory tree.
issues:
what about cleaning up __pending? Job commit?
need to stop someone creating a path _ base/ _pending and so sneak in pending stuff/get very confused. Actually, stop __pending under __pending.

Using pointer equality to compare a java.util.ArrayList<org.apache.hadoop.fs.s3a.commit.FileCommitActions$CommitFileOutcome> with a FileCommitActions$CommitFileOutcome in org.apache.hadoop.fs.s3a.commit.FileCommitActions$CommitAllFilesOutcome.add(FileCommitActions$CommitFileOutcome) At FileCommitActions.java:a java.util.ArrayList<org.apache.hadoop.fs.s3a.commit.FileCommitActions$CommitFileOutcome> with a FileCommitActions$CommitFileOutcome in org.apache.hadoop.fs.s3a.commit.FileCommitActions$CommitAllFilesOutcome.add(FileCommitActions$CommitFileOutcome) At FileCommitActions.java:[line 274]

Should org.apache.hadoop.fs.s3a.commit.S3AOutputCommitter$DurationInfo be a static inner class? At S3AOutputCommitter.java:inner class? At S3AOutputCommitter.java:[lines 349-366]

I don't know what's up here, but I do know that (a) we're doing too many deletes during the commit process, more specifically: creating too many mock empty dirs. That can be optimised with a "delete don't care about a parent dir" method in writeOperationHelper.

In the first s3guard local test, testMapFileOutputCommitter all is well in the FS used by the job, but when a new FS instance is created in the same process, the second one isn't seeing the listing.

make sure that after a multipart put is completed, the metastore is told

a bit more diagnostics on why a test is only failing on s3guard local

added writehelper allows caller to delete a file and not ask for any parent dir to be optionally created. This will allow for a faster scan & commit of all files in a directory tree, after which the tree gets deleted anyway. This is not yet activated.

Fixes findbugs reports (some were real!); javac, javadoc, many of the checkstyle.

Steve Loughran
added a comment - 03/Feb/17 19:12 Patch 006
make sure that after a multipart put is completed, the metastore is told
a bit more diagnostics on why a test is only failing on s3guard local
added writehelper allows caller to delete a file and not ask for any parent dir to be optionally created. This will allow for a faster scan & commit of all files in a directory tree, after which the tree gets deleted anyway. This is not yet activated.
Fixes findbugs reports (some were real!); javac, javadoc, many of the checkstyle.

I've realised we can also do delayed-commit object COPY too, because multipart PUT lets you declare each part to be a copy of a subset of another object. This would allow delayed commit rename() operations; committers could

still need a story for propagating commit metadata from tasks to job committer.

still need a story for cleaning up failed tasks and jobs

still not O(1) in the commit

This could be used in the bit of the spark committer which can move files to absolute locations; tasks would do the copy and send the serialized commit data to the job committer. if the path _temporary/$attemptId/$taskId is used for the work output, then the cleanup mechanism of all work prior to the copy on task/job failure is the same as today's output committer. That doesn't cover the pending commit though, unless again something is saved to the FS somewhere. Again, something could be explicitly PUT into _pending/$jobId/$taskId for this, just save the pending data to a file with a UUID and both commit and abort can find and use it.

Steve Loughran
added a comment - 21/Feb/17 14:42 I've realised we can also do delayed-commit object COPY too, because multipart PUT lets you declare each part to be a copy of a subset of another object. This would allow delayed commit rename() operations; committers could
good
can take any blob and rename it anywhere in the FS
with parallelised PUT calls, copy time is ~ filesize/(parts * 6e6) seconds, same as parallel renames today.
could be retrofitted to existing committers which rely on renames
bad
still need a story for propagating commit metadata from tasks to job committer.
still need a story for cleaning up failed tasks and jobs
still not O(1) in the commit
This could be used in the bit of the spark committer which can move files to absolute locations; tasks would do the copy and send the serialized commit data to the job committer. if the path _temporary/$attemptId/$taskId is used for the work output, then the cleanup mechanism of all work prior to the copy on task/job failure is the same as today's output committer. That doesn't cover the pending commit though, unless again something is saved to the FS somewhere. Again, something could be explicitly PUT into _pending/$jobId/$taskId for this, just save the pending data to a file with a UUID and both commit and abort can find and use it.
Proposed: add this once the core commit mech is done

I'm changing direction slightly here and working on making the first committer a derivative of the Netflix Committer. This stages to the local filesystem, then, in task commit, uploads the generated files as the multipart PUT; co-ordination information is persisted via HDFS. While this appears to add some complexity to the writing process, it avoids "magic" in the filesystem, and, by using HDFS, doesn't need dynamo DB.

What it also adds is: actual use in production, along with minicluster tests. Production use is going to mean that resilience to failures and odd execution orderings are more likely to have been addressed; with my own committer I'd be relearning how things fail.

Accordingly, I think it'd be more likely to be ready for use.

Patch 007 doesn't include any of that, it's the "before" patch.

I'm now merging in the netflix code, using S3A and the WriteOperationHelper as the means of talking to S3. Their code is ASF licensed, but the copyright headers still say Netflix...we need it to be added to this JIRA as a patch before we could think about committing to the ASF codebase. In the meantime, I'll work on it locally

Steve Loughran
added a comment - 06/Mar/17 13:21 Patch 007; (limited) tests now working.
I'm changing direction slightly here and working on making the first committer a derivative of the Netflix Committer . This stages to the local filesystem, then, in task commit, uploads the generated files as the multipart PUT; co-ordination information is persisted via HDFS. While this appears to add some complexity to the writing process, it avoids "magic" in the filesystem, and, by using HDFS, doesn't need dynamo DB.
What it also adds is: actual use in production, along with minicluster tests. Production use is going to mean that resilience to failures and odd execution orderings are more likely to have been addressed; with my own committer I'd be relearning how things fail.
Accordingly, I think it'd be more likely to be ready for use.
Patch 007 doesn't include any of that, it's the "before" patch.
I'm now merging in the netflix code, using S3A and the WriteOperationHelper as the means of talking to S3. Their code is ASF licensed, but the copyright headers still say Netflix...we need it to be added to this JIRA as a patch before we could think about committing to the ASF codebase. In the meantime, I'll work on it locally

Thanks! I'll share what I've done with it soon, which currently comes down to: pulling into hadoop-aws module, moving under org.apache.hadoop.fs.s3a.commit, reparenting it to the new PathOutputCommitter (i.e: no longer a subclass of FileOutputCommitter, for better or worse), breaking all the tests. Related to that reparenting, see.

Steve Loughran
added a comment - 08/Mar/17 10:31 Thanks! I'll share what I've done with it soon, which currently comes down to: pulling into hadoop-aws module, moving under org.apache.hadoop.fs.s3a.commit, reparenting it to the new PathOutputCommitter (i.e: no longer a subclass of FileOutputCommitter, for better or worse), breaking all the tests. Related to that reparenting, see.

Thanks for doing so much to get this in! Let me know when it is a good time to have a look at it or if you need anything. There are a few extension points that we rely on that I'd like to make sure are kept, like getFinalOutputPath that I would like to keep if possible.

Ryan Blue
added a comment - 08/Mar/17 17:56 Thanks for doing so much to get this in! Let me know when it is a good time to have a look at it or if you need anything. There are a few extension points that we rely on that I'd like to make sure are kept, like getFinalOutputPath that I would like to keep if possible.

Ryan, I think you'll find this is inextricably linked with S3a now; I'll leave that hook in, but it'll ultimately be tied down. Why? I want to lock all access down to operations in S3AFS, rather than let you get at the Client. Do that and not only do the metrics & stuff get lost, but s3guard won't like it.

Steve Loughran
added a comment - 08/Mar/17 22:04 Ryan, I think you'll find this is inextricably linked with S3a now; I'll leave that hook in, but it'll ultimately be tied down. Why? I want to lock all access down to operations in S3AFS, rather than let you get at the Client. Do that and not only do the metrics & stuff get lost, but s3guard won't like it.

Patch 009; pulling in Ryan's netflix committer; I've called in "the Staging committer", with the one playing games in S3a "the magic committer"; that one I've actually put to one side as an "eventually" feature; the goal being: common code & data underneath, but different strategies of getting the data up.

It uses the existing FileOutputCommitter for its work; some time was spent in a debugger until I worked out that the code relies on Algorithm 1, not the current default of "2". Ideally I'd rip out the FOC altogether, it's just a complication once each task only generates a single file whose name can be guaranteed to be unique. Task commits copy into the job dir; job commits rename to dest, files are scanned and used to generate the multipart commits for the real dest. Leaving alone for now.

All of Ryan's tests are failing as the FS mocking is being rejected "not an s3a filesystem"'; more migration work needed there.

I've been working on the protocol integration test copied out from the mapreduce module; this tests various sequences of the operations and asserts about final outcomes. Some of them are starting to work (core commit and abort), but not those with subdirectories, such as the intermediate Map output.

Steve Loughran
added a comment - 08/Mar/17 22:16 Patch 009; pulling in Ryan's netflix committer; I've called in "the Staging committer", with the one playing games in S3a "the magic committer"; that one I've actually put to one side as an "eventually" feature; the goal being: common code & data underneath, but different strategies of getting the data up.
Details at the end of: https://github.com/steveloughran/hadoop/blob/s3guard/HADOOP-13786-committer/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/s3a_committer.md
It uses the existing FileOutputCommitter for its work; some time was spent in a debugger until I worked out that the code relies on Algorithm 1, not the current default of "2". Ideally I'd rip out the FOC altogether, it's just a complication once each task only generates a single file whose name can be guaranteed to be unique. Task commits copy into the job dir; job commits rename to dest, files are scanned and used to generate the multipart commits for the real dest. Leaving alone for now.
All of Ryan's tests are failing as the FS mocking is being rejected "not an s3a filesystem"'; more migration work needed there.
I've been working on the protocol integration test copied out from the mapreduce module; this tests various sequences of the operations and asserts about final outcomes. Some of them are starting to work (core commit and abort), but not those with subdirectories, such as the intermediate Map output.

Dead store to outputUri in org.apache.hadoop.fs.s3a.commit.staging.StagingS3GuardCommitter.getOutputPath(JobContext) At StagingS3GuardCommitter.java:org.apache.hadoop.fs.s3a.commit.staging.StagingS3GuardCommitter.getOutputPath(JobContext) At StagingS3GuardCommitter.java:[line 873]

Dead store to fs in org.apache.hadoop.fs.s3a.commit.staging.StagingS3GuardCommitter.setupTask(TaskAttemptContext) At StagingS3GuardCommitter.java:org.apache.hadoop.fs.s3a.commit.staging.StagingS3GuardCommitter.setupTask(TaskAttemptContext) At StagingS3GuardCommitter.java:[line 644]

Private method org.apache.hadoop.fs.s3a.commit.staging.StagingS3GuardCommitter.getPendingJobAttemptsPath() is never called At StagingS3GuardCommitter.java:called At StagingS3GuardCommitter.java:[line 293]

Private method org.apache.hadoop.fs.s3a.commit.staging.StagingS3GuardCommitter.getPendingTaskAttemptsPath(JobContext) is never called At StagingS3GuardCommitter.java:called At StagingS3GuardCommitter.java:[line 247]

Test wise, I've been focusing on the integration tests org.apache.hadoop.fs.s3a.commit.staging.ITestStagingCommitProtocol, which is derived from org.apache.hadoop.mapreduce.lib.output.TestFileOutputCommitter ; because they're from there they form part of the expectations of the protocol implementation for commitment
(more precisely, they're the closest we have to any definition). All these intiial tests work, except for those generating files in a subdirectory, e.g. part-0000/subfile; something critical for the intermediate output of an MR job.

The scanner for files to upload is just doing a flat list and then getting into trouble when it gets handed a directory to upload instead of a simple file.

java.io.FileNotFoundException: Not a file/Users/stevel/Projects/hadoop-trunk/hadoop-tools/hadoop-aws/target/tmp/mapred/local/job_200707121733_0001/_temporary/0/_temporary/attempt_200707121733_0001_m_000000_0/part-m-00000
at org.apache.hadoop.fs.s3a.commit.staging.S3Util.multipartUpload(S3Util.java:104)
at org.apache.hadoop.fs.s3a.commit.staging.StagingS3GuardCommitter$8.run(StagingS3GuardCommitter.java:784)
at org.apache.hadoop.fs.s3a.commit.staging.StagingS3GuardCommitter$8.run(StagingS3GuardCommitter.java:771)
at org.apache.hadoop.fs.s3a.commit.staging.Tasks$Builder.runSingleThreaded(Tasks.java:122)
at org.apache.hadoop.fs.s3a.commit.staging.Tasks$Builder.run(Tasks.java:108)
at org.apache.hadoop.fs.s3a.commit.staging.StagingS3GuardCommitter.commitTaskInternal(StagingS3GuardCommitter.java:771)
at org.apache.hadoop.fs.s3a.commit.staging.StagingS3GuardCommitter.commitTask(StagingS3GuardCommitter.java:716)
at org.apache.hadoop.fs.s3a.commit.AbstractITCommitProtocol.commit(AbstractITCommitProtocol.java:684)
at org.apache.hadoop.fs.s3a.commit.AbstractITCommitProtocol.testMapFileOutputCommitter(AbstractITCommitProtocol.java:567)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
at org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)

What I need to do is go from the flat listFiles() to the recursive one, then use that to create the offset of the final destination.

Regarding the mock tests, it's all happening because bits of the code now expect and S3aFS, and its only a base FileSystem, with a limited set of operations, being mocked

It's not going to be quite enough to mock S3AFS, unless those extra methods come in (which will surface as we try).

FWIW, I'd actually prefer that, wherever possible, real integration tests were used over mock ones. Yes, they are slower, no, yetus and jenkins don't run them, but because they really do test the endpoint, and will catch regressions in the S3A client itself, quirks in different S3 implementations, etc. Given you've written them, it'll be good to get working. And the fault generation is great to test the resilience of the committer.

Steve Loughran
added a comment - 09/Mar/17 18:14 Right now I'm doing this out of an ASF branch, rebasing onto the HADOOP-13345 branch regularly. That way I can do things without adult supervision
https://github.com/steveloughran/hadoop/tree/s3guard/HADOOP-13786-committer
Test wise, I've been focusing on the integration tests org.apache.hadoop.fs.s3a.commit.staging.ITestStagingCommitProtocol , which is derived from org.apache.hadoop.mapreduce.lib.output.TestFileOutputCommitter ; because they're from there they form part of the expectations of the protocol implementation for commitment
(more precisely, they're the closest we have to any definition). All these intiial tests work, except for those generating files in a subdirectory, e.g. part-0000/subfile ; something critical for the intermediate output of an MR job.
The scanner for files to upload is just doing a flat list and then getting into trouble when it gets handed a directory to upload instead of a simple file.
java.io.FileNotFoundException: Not a file/Users/stevel/Projects/hadoop-trunk/hadoop-tools/hadoop-aws/target/tmp/mapred/local/job_200707121733_0001/_temporary/0/_temporary/attempt_200707121733_0001_m_000000_0/part-m-00000
at org.apache.hadoop.fs.s3a.commit.staging.S3Util.multipartUpload(S3Util.java:104)
at org.apache.hadoop.fs.s3a.commit.staging.StagingS3GuardCommitter$8.run(StagingS3GuardCommitter.java:784)
at org.apache.hadoop.fs.s3a.commit.staging.StagingS3GuardCommitter$8.run(StagingS3GuardCommitter.java:771)
at org.apache.hadoop.fs.s3a.commit.staging.Tasks$Builder.runSingleThreaded(Tasks.java:122)
at org.apache.hadoop.fs.s3a.commit.staging.Tasks$Builder.run(Tasks.java:108)
at org.apache.hadoop.fs.s3a.commit.staging.StagingS3GuardCommitter.commitTaskInternal(StagingS3GuardCommitter.java:771)
at org.apache.hadoop.fs.s3a.commit.staging.StagingS3GuardCommitter.commitTask(StagingS3GuardCommitter.java:716)
at org.apache.hadoop.fs.s3a.commit.AbstractITCommitProtocol.commit(AbstractITCommitProtocol.java:684)
at org.apache.hadoop.fs.s3a.commit.AbstractITCommitProtocol.testMapFileOutputCommitter(AbstractITCommitProtocol.java:567)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
at org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
What I need to do is go from the flat listFiles() to the recursive one, then use that to create the offset of the final destination.
Regarding the mock tests, it's all happening because bits of the code now expect and S3aFS, and its only a base FileSystem, with a limited set of operations, being mocked
It's not going to be quite enough to mock S3AFS, unless those extra methods come in (which will surface as we try).
FWIW, I'd actually prefer that, wherever possible, real integration tests were used over mock ones. Yes, they are slower, no, yetus and jenkins don't run them, but because they really do test the endpoint, and will catch regressions in the S3A client itself, quirks in different S3 implementations, etc. Given you've written them, it'll be good to get working. And the fault generation is great to test the resilience of the committer.

A fair few of the tests are failing in that the mock calls don't follow the path they've expected; that's the problem with mocks...you are asserting about the internal operations, rather than the final observed state of the SUT. Sometimes that's good for looking into the interior, but it's very, very brittle.

These tests are all doing something ugly to get a mock s3a FS set up for code to get when they ask for an FS. I plan to remove that wrapper mock and inject whatever mock FS is created straight into the FileSystem.get() cache. That's the proper way to do it.

After that, I'll look at why the tests are failing, focusing on ones where results are not what is expected, rather than just mock counter mismatch. I'll assume those are false alarms for now, and only worry about the details once the more functional tests have passed

Steve Loughran
added a comment - 09/Mar/17 19:41 Patch 010; work on the mock tests. Overall, ~50% success rate.
These tests are failing for two main reasons
I've broken the code
In changing the code, I've broken the test
A fair few of the tests are failing in that the mock calls don't follow the path they've expected; that's the problem with mocks...you are asserting about the internal operations, rather than the final observed state of the SUT. Sometimes that's good for looking into the interior, but it's very, very brittle.
These tests are all doing something ugly to get a mock s3a FS set up for code to get when they ask for an FS. I plan to remove that wrapper mock and inject whatever mock FS is created straight into the FileSystem.get() cache. That's the proper way to do it.
After that, I'll look at why the tests are failing, focusing on ones where results are not what is expected, rather than just mock counter mismatch. I'll assume those are false alarms for now, and only worry about the details once the more functional tests have passed

Dead store to outputUri in org.apache.hadoop.fs.s3a.commit.staging.StagingS3GuardCommitter.getOutputPath(JobContext) At StagingS3GuardCommitter.java:org.apache.hadoop.fs.s3a.commit.staging.StagingS3GuardCommitter.getOutputPath(JobContext) At StagingS3GuardCommitter.java:[line 940]

Private method org.apache.hadoop.fs.s3a.commit.staging.StagingS3GuardCommitter.getPendingJobAttemptsPath() is never called At StagingS3GuardCommitter.java:called At StagingS3GuardCommitter.java:[line 314]

Private method org.apache.hadoop.fs.s3a.commit.staging.StagingS3GuardCommitter.getPendingTaskAttemptsPath(JobContext) is never called At StagingS3GuardCommitter.java:called At StagingS3GuardCommitter.java:[line 268]

-: staging to local disk not essentail and can cause issues (disk out of space, disk throughput limitation = network is faster, large files ...) = one of the reasons we made the blockoutputstreams

MagicCommiter:

+: less moving parts (no local disk, no second committer for the metadata)

+: s3 complexity solved by reusing s3ablockoutputstream

-: less mileage / testing: if things break where will there be garbage?

Saw quite some unused variables, methods, etc. but I considered this work in progress so /ignore

There are some extra HEAD requests to verify that things really happened which I think are superfluous. For instance, org/apache/hadoop/fs/s3a/commit/FileCommitActions.java:78 the HEAD after PUT. Is this to get around eventual consistency? This is head after put, so should be consistent even on aws. Maybe useful for PUT overwrite but then you also need to check the ETag. I think this HEAD should be optional behind some config flag.

Thomas Demoor
added a comment - 10/Mar/17 16:16 Hi Ryan, thanks for contributing to this effort.
Both approaches have the same concept of having data persistence but postponing object visibility through holding back on the multipart commit. +1 on sharing common code and having 2 committers.
Thoughts after my first pass, on 009.patch:
StagingCommitter:
I can see how this is ideal in a pre s3ablockoutputstream world. Using the stage to disk allows code which is conceptually similar to the exisiting filoutputcommitter
+: well tested and lots of failure handling (crash during overwrite protection.)
+: flexible, we could still use s3a as the wrappedCommiter for s3guard / consistent s3 implementations and thus not require a hdfs cluster to be around
-: s3client in committer itself (another threadpool, will they fight? )
-: staging to local disk not essentail and can cause issues (disk out of space, disk throughput limitation = network is faster, large files ...) = one of the reasons we made the blockoutputstreams
MagicCommiter:
+: less moving parts (no local disk, no second committer for the metadata)
+: s3 complexity solved by reusing s3ablockoutputstream
-: less mileage / testing: if things break where will there be garbage?
Saw quite some unused variables, methods, etc. but I considered this work in progress so /ignore
There are some extra HEAD requests to verify that things really happened which I think are superfluous. For instance, org/apache/hadoop/fs/s3a/commit/FileCommitActions.java:78 the HEAD after PUT. Is this to get around eventual consistency? This is head after put, so should be consistent even on aws. Maybe useful for PUT overwrite but then you also need to check the ETag. I think this HEAD should be optional behind some config flag.
Varia:
Don't think the default (hdfs) workflow has crash-during-jobcommit protection (fileoutputcommitter algo v2 already makes files visible in taskcommit afaik) , so not sure it is required for our committers.

For the staging committer drawbacks, I think there's a clear path to avoid them.

The committer is not intended to instantiate its own S3Client. It does for testing, but when it is integrated with S3A it should be passed a configured client when it is instantiated, or should use package-local access to get one from the S3A FS object. In other words, the default findClient method shouldn't be used; we don't use it other than for testing. My intent was for S3A to have a FileSystem#newOutputCommitter(Path, JobContext) factory method. That way, the FS can pass its internal S3 client instead of instantiating two.

The storage on local disk isn't a requirement. We can replace that with an output stream that buffers in memory and sends parts to S3 when they are ready (we're planning on doing this eventually). This is just waiting on a stable API to rely on that can close a stream, but not commit data. Since the committer API right now expects tasks to create files underneath the work path, we'll have to figure out how tasks can get a multi-part stream that is committed later without using a different method.

We can also pass in a thread-pool if there is a better one to use. I think this is separate enough that it should be easy.

Ryan Blue
added a comment - 10/Mar/17 18:16 For the staging committer drawbacks, I think there's a clear path to avoid them.
The committer is not intended to instantiate its own S3Client. It does for testing, but when it is integrated with S3A it should be passed a configured client when it is instantiated, or should use package-local access to get one from the S3A FS object. In other words, the default findClient method shouldn't be used; we don't use it other than for testing. My intent was for S3A to have a FileSystem#newOutputCommitter(Path, JobContext) factory method. That way, the FS can pass its internal S3 client instead of instantiating two.
The storage on local disk isn't a requirement. We can replace that with an output stream that buffers in memory and sends parts to S3 when they are ready (we're planning on doing this eventually). This is just waiting on a stable API to rely on that can close a stream, but not commit data. Since the committer API right now expects tasks to create files underneath the work path, we'll have to figure out how tasks can get a multi-part stream that is committed later without using a different method.
We can also pass in a thread-pool if there is a better one to use. I think this is separate enough that it should be easy.

Patch 011; lot more of the tests are working. More specifically, two causes of failure.

the tests which are using mockito to its fullest, TestStagingDirectoryOutputCommitter and TestStagingPartitionedJobCommit are failing because the FS operations have changed; some differences in which methods get called and when. They're going to have be synced up and then maintained. It's a pain, but it does let the tests run even without credentials.

The tests which expect the names of the output to be part-0000-$UUID. I turned that off yesterday while getting the "commit protocol" integration test to work; it doesn't do that, though I know sometimes spark jobs do for that unique output.

Steve Loughran
added a comment - 10/Mar/17 21:30 Patch 011; lot more of the tests are working. More specifically, two causes of failure.
the tests which are using mockito to its fullest, TestStagingDirectoryOutputCommitter and TestStagingPartitionedJobCommit are failing because the FS operations have changed; some differences in which methods get called and when. They're going to have be synced up and then maintained. It's a pain, but it does let the tests run even without credentials.
The tests which expect the names of the output to be part-0000-$UUID. I turned that off yesterday while getting the "commit protocol" integration test to work; it doesn't do that, though I know sometimes spark jobs do for that unique output.
I don't know what to do here? Hard code "no uuid suffix" & change the tests, probably break ryan's code, not suit other people. Hard code "uuid suffix:, some things work, others get confused. Or go for the option, double your test space.
This is somewhere where we'll need to call in the people who understand the internals of commitment and its integration

FileCommitActions is my code. Thomas: blame me there. Sanity check by the look of things. Ryan's committer has a near-identical persistent data structure of pending commit information, different commit codepath. It can move over to FileCommitActions once PersistentCommitData also supports ser/deser of a list of commits to a (JSON) file, rather than just one file/pending commit

we are going to have to move sight of the S3Client away from the committers entirely, so that s3guard can stay in sync with what's happening. Otherwise a caller can use the client to complete the put, but s3guard won't know to update its tables. S3aFileSystem.WriteOperationHelper has everything needed, or, if it doesn't, can add the rest. I've not gone near that yet as getting tests working comes ahead of getting the integration complete.

Again, threadpools: no real opinion. The bulk uploads should be assigned to the "slow operations pool" on the FS and if we have a "faster ops pool", all the abort/commit calls in there. I do like that whole Threads design BTW, very nice, sliding in to a java 8+ world nicely.

Steve Loughran
added a comment - 10/Mar/17 21:40
FileCommitActions is my code. Thomas: blame me there. Sanity check by the look of things. Ryan's committer has a near-identical persistent data structure of pending commit information, different commit codepath. It can move over to FileCommitActions once PersistentCommitData also supports ser/deser of a list of commits to a (JSON) file, rather than just one file/pending commit
we are going to have to move sight of the S3Client away from the committers entirely, so that s3guard can stay in sync with what's happening. Otherwise a caller can use the client to complete the put, but s3guard won't know to update its tables. S3aFileSystem.WriteOperationHelper has everything needed, or, if it doesn't, can add the rest. I've not gone near that yet as getting tests working comes ahead of getting the integration complete.
Again, threadpools: no real opinion. The bulk uploads should be assigned to the "slow operations pool" on the FS and if we have a "faster ops pool", all the abort/commit calls in there. I do like that whole Threads design BTW, very nice, sliding in to a java 8+ world nicely.

Dead store to outputUri in org.apache.hadoop.fs.s3a.commit.staging.StagingS3GuardCommitter.getOutputPath(JobContext) At StagingS3GuardCommitter.java:org.apache.hadoop.fs.s3a.commit.staging.StagingS3GuardCommitter.getOutputPath(JobContext) At StagingS3GuardCommitter.java:[line 943]

Private method org.apache.hadoop.fs.s3a.commit.staging.StagingS3GuardCommitter.getPendingJobAttemptsPath() is never called At StagingS3GuardCommitter.java:called At StagingS3GuardCommitter.java:[line 316]

Private method org.apache.hadoop.fs.s3a.commit.staging.StagingS3GuardCommitter.getPendingTaskAttemptsPath(JobContext) is never called At StagingS3GuardCommitter.java:called At StagingS3GuardCommitter.java:[line 270]

On UUID suffixes: the option is needed for the "append" conflict resolution. Without it, you can easily overwrite files from previous writes. We also use it to identify files across partitions added in the same batch. I think it is worth keeping if this committer is to be used for more than just creating or replacing a single directory of files.

Ryan Blue
added a comment - 10/Mar/17 23:33 On UUID suffixes: the option is needed for the "append" conflict resolution. Without it, you can easily overwrite files from previous writes. We also use it to identify files across partitions added in the same batch. I think it is worth keeping if this committer is to be used for more than just creating or replacing a single directory of files.

OK, I'll keep as an on/off feature, on by default, off for some of the tests. The UUID is also needed for temp file placement.

On that topic, I've tweaked temp file setup for both testing (use java.io.tmpdir which will be under target) and on HDFS. Really, every FS should have a .getTempDir() method to return a temp dir guaranteed to have R/W access by a specific user.

Steve Loughran
added a comment - 11/Mar/17 11:57 OK, I'll keep as an on/off feature, on by default, off for some of the tests. The UUID is also needed for temp file placement.
On that topic, I've tweaked temp file setup for both testing (use java.io.tmpdir which will be under target) and on HDFS. Really, every FS should have a .getTempDir() method to return a temp dir guaranteed to have R/W access by a specific user.

the mocks in TestStagingPartitionedJobCommit, TestStagingDirectoryOutputCommitter are having an unexpected delete in mock invocations (good? bad?), and because the staging committer doesn't yet handle directories, some of the protocol tests are failing.

Added AbstractITCommitProtocol subclasses for the directory and partition committers

Those in the protocol IT tests which are failing on the job/commit fail are failing as the tests aren't looking for the right exception.

One thing to highlight here is that when running these tests from my desktop, the staging commits seem to be faster than the magic ones. Why? A lot less S3 communication over a long haul link during task setup/commit, and there's no real data to upload, so the cost delaying the upload until the task commit is negligible.

Steve Loughran
added a comment - 13/Mar/17 21:54 - edited patch 012:
UUID reinstated by default; you can turn this off.
mostly all the unit tests are passing,
the mocks in TestStagingPartitionedJobCommit , TestStagingDirectoryOutputCommitter are having an unexpected delete in mock invocations (good? bad?), and because the staging committer doesn't yet handle directories, some of the protocol tests are failing.
Added AbstractITCommitProtocol subclasses for the directory and partition committers
Those in the protocol IT tests which are failing on the job/commit fail are failing as the tests aren't looking for the right exception.
One thing to highlight here is that when running these tests from my desktop, the staging commits seem to be faster than the magic ones. Why? A lot less S3 communication over a long haul link during task setup/commit, and there's no real data to upload, so the cost delaying the upload until the task commit is negligible.

Dead store to attemptFS in org.apache.hadoop.fs.s3a.commit.staging.PartitionedStagingCommitter.commitTask(TaskAttemptContext) At PartitionedStagingCommitter.java:org.apache.hadoop.fs.s3a.commit.staging.PartitionedStagingCommitter.commitTask(TaskAttemptContext) At PartitionedStagingCommitter.java:[line 94]

All mock tests are working.

More specifically, they were got working, then the commit logic tuned to reduce the number of S3 calls

no exists(path) check before a delete() if the policy is replace

no checking for return code of delete(), as we know it never signals an error, merely that the destination path didn't exist at the time of call.

The tests have also been tuned to be a bit more explicit about what they are declaring and asserting; less repetition in mock object setup.

Also: ability to turn up logging of mock operations, including stack trace level of invocation. Useful to work out things like why more delete() calls are made than expected.

Most of the committer IT tests are working

Everything is working except the IT protocol tests testMapFileOutputCommitter and testConcurrentCommitTaskWithSubDir, which expect directories to be handled. I will do that at least for the Directory Committer, with some mock tests as well as a fixed IT test, and skip them in the partition committer

LambdaTestUtils tuning

Ryan's patch had some assertThrown() assertions which I've been moving to the common test base.

While intercept() has some features assertThrown() lacked, it doesn't support handing down extra diagnostics messages. Fixed. We could go one step further and allow callers to provide a closure () -> String for diagnostics, perhaps, though maybe we can wait to see what JUnit 5 has first

TODO

directory trees in the directory committer

move to direct API calls on S3A

when s3guard is enabled, make sure PUT commits are updating the entire metastore tree.

Steve Loughran
added a comment - 14/Mar/17 21:33 HADOOP-13786 patch 013.
All mock tests are working.
More specifically, they were got working, then the commit logic tuned to reduce the number of S3 calls
no exists(path) check before a delete() if the policy is replace
no checking for return code of delete() , as we know it never signals an error, merely that the destination path didn't exist at the time of call.
The tests have also been tuned to be a bit more explicit about what they are declaring and asserting; less repetition in mock object setup.
Also: ability to turn up logging of mock operations, including stack trace level of invocation. Useful to work out things like why more delete() calls are made than expected.
Most of the committer IT tests are working
Everything is working except the IT protocol tests testMapFileOutputCommitter and testConcurrentCommitTaskWithSubDir , which expect directories to be handled. I will do that at least for the Directory Committer, with some mock tests as well as a fixed IT test, and skip them in the partition committer
LambdaTestUtils tuning
Ryan's patch had some assertThrown() assertions which I've been moving to the common test base.
While intercept() has some features assertThrown() lacked, it doesn't support handing down extra diagnostics messages. Fixed. We could go one step further and allow callers to provide a closure () -> String for diagnostics, perhaps, though maybe we can wait to see what JUnit 5 has first
TODO
directory trees in the directory committer
move to direct API calls on S3A
when s3guard is enabled, make sure PUT commits are updating the entire metastore tree.

Becoming ready for review, the merge is done as far as core functionality through a direct s3 connection is concerned. I've migrated Ryan's code and fixed his tests and mine.

Unified the commit logic between staging committers; there's an explicit preCommit(context, pending) in StagingS3GuardCommitter for subclasses to override (Base: no-op); if the precommit fails, pending ops are rolled back (existing code in PartitionStagingCommitter.

added subdirectory support to the DirectoryStagingCommitter

added option to create _SUCCESS markers in these commiters, default is true, as per the FileOutputCommitter

some more tuning in the conflict resolution to trim the number of S3 calls. Every getFileStatus/exists call is sacred

all tests are passing

all integration tests are passing

TODO

More tests needed, obviously. I'm thinking of some scale ones with many files

some metrics; it'd be good to know the number of files and amount of bytes uploaded in commits. This is implicitly measured, but not called out. Knowing bytes uploaded in commit will show impact of commit. If compared with files_copied_bytes (done in S3) you can start to estimate impact of operations.

switch to FileCommitActions and the S3a methods for the s3 ops. This will require us to mock all that stuff too; still thinking of best way.

the MPU commits to update s3guard

As an aside, because MPUs have request IDs which can be cancelled explicitly, it could be possible for a future committer to actually write direct to S3, saving the pending data to the committer directory in HDFS. This would blur the magic committer with the staging one; just rely on HDFS to implement the consistent commit logic through renames, and so use s3 as a destination with or without consistency there. I think you'd still need the _magic directory though, so the working dir would trigger the special output operation.

Steve Loughran
added a comment - 15/Mar/17 21:51 - edited Patch 013.
Becoming ready for review, the merge is done as far as core functionality through a direct s3 connection is concerned. I've migrated Ryan's code and fixed his tests and mine.
Unified the commit logic between staging committers; there's an explicit preCommit(context, pending) in StagingS3GuardCommitter for subclasses to override (Base: no-op); if the precommit fails, pending ops are rolled back (existing code in PartitionStagingCommitter.
added subdirectory support to the DirectoryStagingCommitter
added option to create _SUCCESS markers in these commiters, default is true, as per the FileOutputCommitter
some more tuning in the conflict resolution to trim the number of S3 calls. Every getFileStatus/exists call is sacred
all tests are passing
all integration tests are passing
TODO
More tests needed, obviously. I'm thinking of some scale ones with many files
some metrics; it'd be good to know the number of files and amount of bytes uploaded in commits. This is implicitly measured, but not called out. Knowing bytes uploaded in commit will show impact of commit. If compared with files_copied_bytes (done in S3) you can start to estimate impact of operations.
switch to FileCommitActions and the S3a methods for the s3 ops. This will require us to mock all that stuff too; still thinking of best way.
the MPU commits to update s3guard
As an aside, because MPUs have request IDs which can be cancelled explicitly, it could be possible for a future committer to actually write direct to S3, saving the pending data to the committer directory in HDFS. This would blur the magic committer with the staging one; just rely on HDFS to implement the consistent commit logic through renames, and so use s3 as a destination with or without consistency there. I think you'd still need the _magic directory though, so the working dir would trigger the special output operation.

On metrics: our layer on top of the staging committer tracks metrics and sends them to the job committer using the same PendingCommit that already gets serialized. That's an easy way to get more data back to the job committer, which then accumulates the number of files, sizes, etc. and stores it somewhere (or logs it?).

Ryan Blue
added a comment - 15/Mar/17 22:16 On metrics: our layer on top of the staging committer tracks metrics and sends them to the job committer using the same PendingCommit that already gets serialized. That's an easy way to get more data back to the job committer, which then accumulates the number of files, sizes, etc. and stores it somewhere (or logs it?).

S3a toString prints them; the input & output streams also print theirs. All 2.8+ filesystems support a new method, {{getStorageStatistics() }} which return the map of name -> counter value; HDFS & s3a implement this & share common keys where possible.

For a 2.8+ committer only, I plan to grab the FS details from every task and, like you say, copy them back to the job for printing. Spark itself could be extended there; while moving to 2.8+ isn't going to happen, if there was a way for every job to provide a key->long map, that 2.8+ committer could return the stats.

Side issue: filesystems are per user; multiple executors on the same FS will pick up shared values. We could have the committer independently track the bytes uploaded in the commit, that being the key delay. Acutally, that's implicitly in there in the object length field.

I like the idea of serializing in the pending commit data, though that doesn't count the details of uncommitted IO, does it. We'd really want that too, somehow. It's still work —just wasted work.

Tez already grabs and logs the storage stats, uploads them to ATS, incidentally.

Steve Loughran
added a comment - 16/Mar/17 14:50 For metrics, S3a now collects a very detailed set, on low level HTTP verbs, stream reads (including aborts, seek lengths), upload data, errors seen: https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Statistic.java
S3a toString prints them; the input & output streams also print theirs. All 2.8+ filesystems support a new method, {{getStorageStatistics() }} which return the map of name -> counter value; HDFS & s3a implement this & share common keys where possible.
For a 2.8+ committer only, I plan to grab the FS details from every task and, like you say, copy them back to the job for printing. Spark itself could be extended there; while moving to 2.8+ isn't going to happen, if there was a way for every job to provide a key->long map, that 2.8+ committer could return the stats.
Side issue: filesystems are per user; multiple executors on the same FS will pick up shared values. We could have the committer independently track the bytes uploaded in the commit, that being the key delay. Acutally, that's implicitly in there in the object length field.
I like the idea of serializing in the pending commit data, though that doesn't count the details of uncommitted IO, does it. We'd really want that too, somehow. It's still work —just wasted work.
Tez already grabs and logs the storage stats, uploads them to ATS, incidentally.

mostly trying to get checkstyle to STFU, except where I felt it was spurious.

IT committer tests verifies that task commit doesn't materialize any part- files in the dest dir. The magic committer does; I've got that bit of the protocol wrong. Which now leaves me with the problem of : how to get the information about committed tasks into the job dir without renames? Plan: read in the .pending files and save a single "all PUTs in this task" file, unified with the staging committers; delete the task area

Steve Loughran
added a comment - 16/Mar/17 19:26 patch 015 (skipping 14 for various reasons)
mostly trying to get checkstyle to STFU, except where I felt it was spurious.
IT committer tests verifies that task commit doesn't materialize any part- files in the dest dir. The magic committer does; I've got that bit of the protocol wrong. Which now leaves me with the problem of : how to get the information about committed tasks into the job dir without renames? Plan: read in the .pending files and save a single "all PUTs in this task" file, unified with the staging committers; delete the task area

checkstyles, still leaving line length. I really can't be motivated to encapsulate all the fields in test base classes. This aint production code.

failing test pair look due to ordering of list output; sorting them both before comparing

No other changes

The test run here really played up, including unrelated stuff like ITestS3AContractDistCp, which run standalone. I think there's multipart upload cleanup going on which is interfering with parallel tests.

Plan:

identify where that's happening.

have a sysprop indicating parallel/serial test running, and in serial mode: cleanup

Steve Loughran
added a comment - 17/Mar/17 12:59 Patch 016, jenkins issues
checkstyles, still leaving line length. I really can't be motivated to encapsulate all the fields in test base classes. This aint production code.
failing test pair look due to ordering of list output; sorting them both before comparing
No other changes
The test run here really played up, including unrelated stuff like ITestS3AContractDistCp , which run standalone. I think there's multipart upload cleanup going on which is interfering with parallel tests.
Plan:
identify where that's happening.
have a sysprop indicating parallel/serial test running, and in serial mode: cleanup

patch 017
*More tests of what happens in various abort sequences; I'll next have the staging task committers provide a way to locate their local dirs, so I can verify they get deleted.

Minor doc tuning...turning off the TOC macro gets it to render.

new ITest case testAbortJobNotTask shows that Magic committer jobAbort() doesn't reliably abort pending requests from tasks. While I'm focusing on the staging, I do want this to at least be passing the basic tests.

Steve Loughran
added a comment - 21/Mar/17 18:51 patch 017
*More tests of what happens in various abort sequences; I'll next have the staging task committers provide a way to locate their local dirs, so I can verify they get deleted.
Minor doc tuning...turning off the TOC macro gets it to render.
new ITest case testAbortJobNotTask shows that Magic committer jobAbort() doesn't reliably abort pending requests from tasks. While I'm focusing on the staging, I do want this to at least be passing the basic tests.

the staging committer throws an exception if the cluster default FS == s3a, at least for now it requires hdfs, file, etc. I don't want confusion in testing, maybe once we are happy that s3guard delivers the consistency we need then it can be supported.

The magic committer is failing tests, it's clear its still immature: I'm focusing on the staging ones. The only main changes there are in cleanup/abort logic.

Steve Loughran
added a comment - 22/Mar/17 18:49 Patch 018
Lot more testing on lifecycle corner cases, especially sequences like "commit job while a task was neither committed nor aborted"
tests moving towards supporting part-0000+UUID in path asserts, though not re-enabled the UUID config option yet.
lots more cleanup in committers of: pending commits, local staging dirs.
the staging committer throws an exception if the cluster default FS == s3a, at least for now it requires hdfs, file, etc. I don't want confusion in testing, maybe once we are happy that s3guard delivers the consistency we need then it can be supported.
The magic committer is failing tests, it's clear its still immature: I'm focusing on the staging ones. The only main changes there are in cleanup/abort logic.
I've now got this working with Spark; people should look at https://github.com/steveloughran/spark-cloud-examples for some tests there.

Its not so much the code quantity as the fact you end up with having to know the commit protocol before you can judge the quality of it, and that's something else to learn. I learned it by single-stepping through the debugger.

I think I could/should split out the magic committer as it is the one doing more to the internals of s3aFS. I think it's a good evolutionary feature, but the staging one should ship first. In which case some of the S3aFS logic can be temporarily culled, though I'd like to retain the changes to the block output stream to allow it to write-complete vs write-pending. the enhancements to WriteOperationsHelper, json persistent format &c are actually things the staging committer(s) can share -and will be needed to keep s3guard's DDB tables up to speed.

Steve Loughran
added a comment - 22/Mar/17 20:38 Its not so much the code quantity as the fact you end up with having to know the commit protocol before you can judge the quality of it, and that's something else to learn. I learned it by single-stepping through the debugger.
I think I could/should split out the magic committer as it is the one doing more to the internals of s3aFS. I think it's a good evolutionary feature, but the staging one should ship first. In which case some of the S3aFS logic can be temporarily culled, though I'd like to retain the changes to the block output stream to allow it to write-complete vs write-pending. the enhancements to WriteOperationsHelper, json persistent format &c are actually things the staging committer(s) can share -and will be needed to keep s3guard's DDB tables up to speed.
I could to a tour early next week, say tuesday morning PST?

Aaron Fabbri
added a comment - 22/Mar/17 20:50
you end up with having to know the commit protocol before you can judge the quality of it, and that's something else to learn.
Indeed. Also, I appreciated the documentation included in the patch, thanks.
I could to a tour early next week, say tuesday morning PST?
Yes, that would be great.

On the staging wrappedCommitter exception on s3a. Could we have a config flag to override that? I think that would satisfy test sanity but still enable consistent s3 implementations to run without deploying an hdfs cluster just for this.

I'd be interested in a tour, in my first review I quickly found out that the rabbithole is deep

Thomas Demoor
added a comment - 22/Mar/17 21:15 On the staging wrappedCommitter exception on s3a. Could we have a config flag to override that? I think that would satisfy test sanity but still enable consistent s3 implementations to run without deploying an hdfs cluster just for this.
I'd be interested in a tour, in my first review I quickly found out that the rabbithole is deep

Steve Loughran
added a comment - 23/Mar/17 13:01 Thomas, I'm aware of your needs. I just put that lock in there for now as my integration tests did have s3a as the default FS, and the debug logs had me confused. Will make it an option.
How would a webex at 09:00 PST work next week; the EU will have switched TZ so it'll be 17:00 my time,, 18:00 thomas's. Or I can slip to 10:00 PST if that suits

Lots of work on cleanup, including abortjob purging all pending uploads against a dest. Without that, there's actually a problem in staging aborts. But what? testRecoveryAndCleanup was finding the problem

this is starting to use commitActions in Staging tests, complicating mocking a bit (more methods need to be handled/stubbed).

a bit more docs on lifecycle, but not enough.

still confusion about when to use (deprecated) cleanup() calls vs abort?

track "role" of an instance, included in logs and exceptions. This makes a big difference in IT tests downstream where there is >1 role/instance per process and you want to identify which is doing work.

Sort term todos:

IT test on commit contention, verify that the committed output completes

Steve Loughran
added a comment - 23/Mar/17 18:39 HADOOP-13786 patch 019: cleanup of jobs & tasks
Lots of work on cleanup, including abortjob purging all pending uploads against a dest. Without that, there's actually a problem in staging aborts. But what? testRecoveryAndCleanup was finding the problem
this is starting to use commitActions in Staging tests, complicating mocking a bit (more methods need to be handled/stubbed).
a bit more docs on lifecycle, but not enough.
still confusion about when to use (deprecated) cleanup() calls vs abort?
track "role" of an instance, included in logs and exceptions. This makes a big difference in IT tests downstream where there is >1 role/instance per process and you want to identify which is doing work.
Sort term todos:
IT test on commit contention, verify that the committed output completes
an IT test of a real MR job
more testing in my downstream module
better specification of what's going on with the commit protocol

And for the very curious, here's a trace of a test run of the latest code doing some simple sparkContact.makeRDD(1 to count).saveAsTextFile("s3a:///something") with the directory committer. There's a job committer and task committer in the logs, hence the role log info to work out whats happening. fs.s3a mostly at debug, except for instrumentation. FS stats exclude all the info on the uploads because they (currently) go straight through the s3a client.

fault injection class created for ability to fail any of the IT test operations, see what happens in the protocols.

IT test of MR job, AbstractITCommitMRJob based on the mock one: sharing core mini cluster setup, test case modified to look for final data. Currently just staging, possible for the others too. Twill need separate assertions. Maybe this can just be parameterized.

StorageStatisticsTracker to snapshot/diff storage stats. I'd hoped to use this for tracking IO in MR job, but as MR doesn't collect storage stats, not yet used. It should be useful in other tests though.

The new IT MR test works, once I've enabled the -unique-filenames option in the test. I've also turned on the same switch for the protocol tests: That doesn't work for the mapper tests.

java.io.FileNotFoundException: index file in s3a://hwdev-steve-ireland-new/test/ITestDirectoryCommitProtocol-testMapFileOutputCommitter/part-m-00000: not found s3a://hwdev-steve-ireland-new/test/ITestDirectoryCommitProtocol-testMapFileOutputCommitter/part-m-00000/index in s3a://hwdev-steve-ireland-new/test/ITestDirectoryCommitProtocol-testMapFileOutputCommitter/part-m-00000
at org.apache.hadoop.fs.contract.ContractTestUtils.verifyPathExists(ContractTestUtils.java:779)
at org.apache.hadoop.fs.contract.ContractTestUtils.assertPathExists(ContractTestUtils.java:757)
at org.apache.hadoop.fs.contract.AbstractFSContractTestBase.assertPathExists(AbstractFSContractTestBase.java:294)
at org.apache.hadoop.fs.s3a.commit.AbstractITCommitProtocol.validateMapFileOutputContent(AbstractITCommitProtocol.java:607)
at org.apache.hadoop.fs.s3a.commit.AbstractITCommitProtocol.testMapFileOutputCommitter(AbstractITCommitProtocol.java:947)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
at org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
Caused by: java.io.FileNotFoundException: No such file or directory: s3a://hwdev-steve-ireland-new/test/ITestDirectoryCommitProtocol-testMapFileOutputCommitter/part-m-00000/index
at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:1906)
at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:1802)
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:1764)
at org.apache.hadoop.fs.contract.ContractTestUtils.verifyPathExists(ContractTestUtils.java:773)

Steve Loughran
added a comment - 24/Mar/17 19:17 patch #20: test changes; nothing to the production code
lambdas -> anonymous classes so things would backport to branch-2
fault injection class created for ability to fail any of the IT test operations, see what happens in the protocols.
IT test of MR job, AbstractITCommitMRJob based on the mock one: sharing core mini cluster setup, test case modified to look for final data. Currently just staging, possible for the others too. Twill need separate assertions. Maybe this can just be parameterized.
StorageStatisticsTracker to snapshot/diff storage stats. I'd hoped to use this for tracking IO in MR job, but as MR doesn't collect storage stats, not yet used. It should be useful in other tests though.
The new IT MR test works, once I've enabled the -unique-filenames option in the test. I've also turned on the same switch for the protocol tests: That doesn't work for the mapper tests.
java.io.FileNotFoundException: index file in s3a: //hwdev-steve-ireland- new /test/ITestDirectoryCommitProtocol-testMapFileOutputCommitter/part-m-00000: not found s3a://hwdev-steve-ireland- new /test/ITestDirectoryCommitProtocol-testMapFileOutputCommitter/part-m-00000/index in s3a://hwdev-steve-ireland- new /test/ITestDirectoryCommitProtocol-testMapFileOutputCommitter/part-m-00000
at org.apache.hadoop.fs.contract.ContractTestUtils.verifyPathExists(ContractTestUtils.java:779)
at org.apache.hadoop.fs.contract.ContractTestUtils.assertPathExists(ContractTestUtils.java:757)
at org.apache.hadoop.fs.contract.AbstractFSContractTestBase.assertPathExists(AbstractFSContractTestBase.java:294)
at org.apache.hadoop.fs.s3a.commit.AbstractITCommitProtocol.validateMapFileOutputContent(AbstractITCommitProtocol.java:607)
at org.apache.hadoop.fs.s3a.commit.AbstractITCommitProtocol.testMapFileOutputCommitter(AbstractITCommitProtocol.java:947)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
at org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
Caused by: java.io.FileNotFoundException: No such file or directory: s3a: //hwdev-steve-ireland- new /test/ITestDirectoryCommitProtocol-testMapFileOutputCommitter/part-m-00000/index
at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:1906)
at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:1802)
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:1764)
at org.apache.hadoop.fs.contract.ContractTestUtils.verifyPathExists(ContractTestUtils.java:773)

side issue: code uses the traditional part-m-%04d naming pattern. Given that s3 uses the first few chars for its hashing, would it better to have a pattern like "p-%04d", or, to be really daring, put the UUID first. I don't think you'd see the benefits of this unless the output was [almost at the root of the bucket though

Steve Loughran
added a comment - 27/Mar/17 10:53 side issue: code uses the traditional part-m-%04d naming pattern. Given that s3 uses the first few chars for its hashing, would it better to have a pattern like "p-%04d", or, to be really daring, put the UUID first. I don't think you'd see the benefits of this unless the output was [almost at the root of the bucket though
https://aws.amazon.com/blogs/aws/amazon-s3-performance-tips-tricks-seattle-hiring-event/]

in the move, switches the various lists that the thread-pooled staging code buids up to being synchronized lists. I think there may have been risk of race conditions there.

Other changes

default unique name option == false

tests can handle option of unique vs non-unique filenames

and the partition committer skips the Mapper test. Doesn't make sense.

Essentinally: the unique name algorithm doesn't work withy map tasks, as they expect a part-m-0000/ dir with children explicitly named "index" and "data". Adding unique names under index and data breaks this.

I'm still undecided about what the best default value is here, more insight and experimentation needed

Steve Loughran
added a comment - 27/Mar/17 19:21 HADOOP-13786 patch 021: unifying data structures and s3 client use
This iteration is starting to move towards using the S3AFS writeOoperations, with the first step being: unified JSON persistence across alll committers
New serializable type, MultiplePendingCommits , containing a list of SinglePendingCommit instances; the latter being the things the magic committer saves.
SinglePendingCommit adds fields/operations needed by staging committer. It's a bit ungainly right now; view as an interim step.
Staging committers and tests all moved over to this datatype instead of seralized java stream. Good: debugging, security, validation logic. Bad: JSON serialization overhead.
in the move, switches the various lists that the thread-pooled staging code buids up to being synchronized lists. I think there may have been risk of race conditions there.
Other changes
default unique name option == false
tests can handle option of unique vs non-unique filenames
and the partition committer skips the Mapper test. Doesn't make sense.
Essentinally: the unique name algorithm doesn't work withy map tasks, as they expect a part-m-0000/ dir with children explicitly named "index" and "data". Adding unique names under index and data breaks this.
I'm still undecided about what the best default value is here, more insight and experimentation needed

IT Test testMapFileOutputCommitter failures with unique IDs enabled

This is failing as the unique ID goes onto the path such as dest/part-m-0000/index-job-attempt_01, part-m-000/data-job-attempt_01. As a result, the files index and data
aren't found in the test or the reducer. The naming needs to go into the part-m-0000-$UUID/index and part-m-0000-$UUID/index.

I've disabled unique filenames by default; adapted files to match

for the partitioned committer, it doesn't work with mappers anyway, because of how things are handled in the commit (as far as I can tell, though I could be confused here.). Disabled the relevant IT test.

TestStatingMRJob Failures

Reasons for failure in Yetus unclear. Locally, appears to work, but in the AM log there's a stack warning that s3a isn't bonding

What's happening? resumably, the mock S3 client isn't getting as far as the MockedStagingCommitter in the AM: a new process is starting and so the special committer pushed into the FileSystem.get() cache isn't being found.

Fix? Presumably to revert to the trick as the original mock tests did: patch in a new fs.s3a.impl class into the job submission and use it.

Steve Loughran
added a comment - 28/Mar/17 13:50 IT Test testMapFileOutputCommitter failures with unique IDs enabled
This is failing as the unique ID goes onto the path such as dest/part-m-0000/index-job-attempt_01 , part-m-000/data-job-attempt_01 . As a result, the files index and data
aren't found in the test or the reducer. The naming needs to go into the part-m-0000-$UUID/index and part-m-0000-$UUID/index .
I've disabled unique filenames by default; adapted files to match
for the partitioned committer, it doesn't work with mappers anyway, because of how things are handled in the commit (as far as I can tell, though I could be confused here.). Disabled the relevant IT test.
TestStatingMRJob Failures
Reasons for failure in Yetus unclear. Locally, appears to work, but in the AM log there's a stack warning that s3a isn't bonding
2017-03-24 16:56:37,468 WARN [CommitterEvent Processor #1] org.apache.hadoop.fs.s3a.S3AUtils: Failed to delete s3a: //bucket-name/output/path/_temporary
java.nio.file.AccessDeniedException: s3a: //bucket-name/output/path/_temporary: getFileStatus on s3a://bucket-name/output/path/_temporary: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: 5F7696A22A66FAAF), S3 Extended Request ID: cVVOEgwybZL0YQbQ0fwFbPMv7GO/TOBAVJed//EQsQUX9dGNkgzCrLy1U6ZbMHmb9qQyKzZSOcw=
at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:159)
at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:102)
at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:1832)
at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:1802)
at org.apache.hadoop.fs.s3a.S3AFileSystem.delete(S3AFileSystem.java:1416)
at org.apache.hadoop.fs.s3a.S3AUtils.deleteWithWarning(S3AUtils.java:756)
at org.apache.hadoop.fs.s3a.commit.staging.StagingS3GuardCommitter.deleteDestinationPaths(StagingS3GuardCommitter.java:817)
at org.apache.hadoop.fs.s3a.commit.staging.StagingS3GuardCommitter.cleanup(StagingS3GuardCommitter.java:780)
at org.apache.hadoop.fs.s3a.commit.staging.StagingS3GuardCommitter.commitJobInternal(StagingS3GuardCommitter.java:685)
at org.apache.hadoop.fs.s3a.commit.staging.StagingS3GuardCommitter.commitJob(StagingS3GuardCommitter.java:625)
at org.apache.hadoop.fs.s3a.commit.staging.MockedStagingCommitter.commitJob(MockedStagingCommitter.java:84)
at org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler$EventProcessor.handleJobCommit(CommitterEventHandler.java:286)
at org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler$EventProcessor.run(CommitterEventHandler.java:238)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang. Thread .run( Thread .java:745)
Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: 5F7696A22A66FAAF), S3 Extended Request ID: cVVOEgwybZL0YQbQ0fwFbPMv7GO/TOBAVJed //EQsQUX9dGNkgzCrLy1U6ZbMHmb9qQyKzZSOcw=
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1586)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1254)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1035)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:747)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:721)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:704)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:672)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:654)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:518)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4185)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4132)
at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1245)
at org.apache.hadoop.fs.s3a.S3AFileSystem.getObjectMetadata(S3AFileSystem.java:1081)
at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:1817)
... 13 more
2017-03-24 16:56:37,470 INFO [CommitterEvent Processor #1] org.apache.hadoop.fs.s3a.commit.
What's happening? resumably, the mock S3 client isn't getting as far as the MockedStagingCommitter in the AM: a new process is starting and so the special committer pushed into the FileSystem.get() cache isn't being found.
Fix? Presumably to revert to the trick as the original mock tests did: patch in a new fs.s3a.impl class into the job submission and use it.
Not we do not want to do that for the real IT tests.

Patch 022
making FileCommitActions the way the s3 operations are accessed; the staging committers no longer have their own S3Client. Instead they pass the FileCommitActions instance down to the S3Util (I've also renamed it StagingUtil to make clear its about staging work, rather than generic s3 work). Fixing up the mock tests to handle this change was more than the production side changes.

Testing:

serialized tests work.

parallel tests, found/fixed bug where pending commits to other bits of the filesystem were being found and aborted (getting prefix setup wrong on listMultipartUploads).

parallel committer tests are failing as they are all using the same temp dir under target/ for their local staging dir, and as job/task IDs are also shared, running >1 committer test in parallel will fail. I need to set up a per-fork temp dir.

Steve Loughran
added a comment - 28/Mar/17 19:27 Patch 022
making FileCommitActions the way the s3 operations are accessed; the staging committers no longer have their own S3Client. Instead they pass the FileCommitActions instance down to the S3Util (I've also renamed it StagingUtil to make clear its about staging work, rather than generic s3 work). Fixing up the mock tests to handle this change was more than the production side changes.
Testing:
serialized tests work.
parallel tests, found/fixed bug where pending commits to other bits of the filesystem were being found and aborted (getting prefix setup wrong on listMultipartUploads).
parallel committer tests are failing as they are all using the same temp dir under target/ for their local staging dir, and as job/task IDs are also shared, running >1 committer test in parallel will fail . I need to set up a per-fork temp dir.

Work done while mostly offline/travelling, so more housekeeping than featur. People on conference wifi get upset if you do s3 scale tests.

Unified constants across committer and with S3 core (e.g. same partition sized used for staging committer part uploads as block output)

made all the import structures consistent

WiP on documenting config options.

Magic committer's abort outcomes more explicit by moving to enum of outcomes rather than a simple success/fail, adding ABORTED and ABORT_FAILED to achieve this.

Test changes primarily related to intermittent test failures; one problem was that if >1 committer test ran in parallel, they could interfere. The Fork ID is now used for the Job ID. That is trickier than you'd think, hence changes to the hadoop-aws POM.

Next todo items

Rebase onto latest HADOOP-13345 ( after that catches up with trunk again)

Move staging onto s3a FS methods

instrument committer operations (maybe add counters for the committers? Or just have them access some new FS stats, "pending commits", "completed"...

Steve Loughran
added a comment - 10/Apr/17 16:21 Patch 23
Work done while mostly offline/travelling, so more housekeeping than featur. People on conference wifi get upset if you do s3 scale tests.
Unified constants across committer and with S3 core (e.g. same partition sized used for staging committer part uploads as block output)
made all the import structures consistent
WiP on documenting config options.
Magic committer's abort outcomes more explicit by moving to enum of outcomes rather than a simple success/fail, adding ABORTED and ABORT_FAILED to achieve this.
Test changes primarily related to intermittent test failures; one problem was that if >1 committer test ran in parallel, they could interfere. The Fork ID is now used for the Job ID. That is trickier than you'd think, hence changes to the hadoop-aws POM.
Next todo items
Rebase onto latest HADOOP-13345 ( after that catches up with trunk again)
Move staging onto s3a FS methods
instrument committer operations (maybe add counters for the committers? Or just have them access some new FS stats, "pending commits", "completed"...

Looks like the latest code is failing in TestStagingMRJob when there's no network as it fails to bond to the test object store. Even if there's binding to the mock FS, looks like something is also trying to talk to the real one