[ https://issues.apache.org/jira/browse/AVRO-1090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13277605#comment-13277605
]
Catalin Alexandru Zamfir commented on AVRO-1090:
------------------------------------------------
Doug, patches are identical. Guess you uploaded the same patch. appendTo signature should
be appendTo (SeekableInput, etc). in the second patch.
It's ok to commit with the "SeekableInput" change. In our tests it passed. Wrote 10M each
thread, we were able to read the data back.
> DataFileWriter should expose "sync marker" to allow concurrent writes to same .avro file
> ----------------------------------------------------------------------------------------
>
> Key: AVRO-1090
> URL: https://issues.apache.org/jira/browse/AVRO-1090
> Project: Avro
> Issue Type: Bug
> Affects Versions: 1.6.3
> Reporter: Catalin Alexandru Zamfir
> Assignee: Doug Cutting
> Fix For: 1.7.0
>
> Attachments: AVRO-1090.patch, AVRO-1090.patch
>
>
> We're writing to Hadoop via DataFileWriter (FSDataOutputStream). We're doing this with
two threads per node, on 8 nodes. Some of the nodes share the same path. For example, our:
TimestampedWriter class, takes a path argument and appends the timestamp to it (ex: SomePath/2012/05/14).
Thus, two threads or two nodes can access the same path. The "race" condition when these streams
are written, is resolved with a check to see if the file exists (has been created) by a faster
thread. If that's so, it appends, instead of creating the file on the HDFS.
> The problem is that DataFileWriter, generates a 16-byte, random string for each instance.
So, two threads with 2 different writer instances, have a different sync marker. That means
that data, when trying to read it back, will get an IOException ("Invalid sync!").
> There's a big performance penalty here. Because only one writer can write at once to
one given path, it becomes a bottleneck. For 1B (billion) rows, it took us 4 hours to generate
& load. With 20 concurrent threads, it took only 12.5 minutes.
> If DataFileWriter would expose the "sync" marker, a developer could read that and make
sure that the next thread that appends to the file, uses the same sync marker. Don't know
if it's even possible to expose the sync marker so as other instances of "DataFileWriter"
can share the sync marker, from the file. We have a fix for this, making sure each writer
is an "unique" instance and generating a path based on that uniqueness. But instead of having
"SomePath/2012/05/14/Shard.avro" we'd now have "SomePath/2012/05/14/Shard-some-random-UUID.avro"
for each of the writers that write the data in.
> If it can be done, it would be a huge fix for a bottleneck problem. The bottleneck being
the single writer that can write to a single path.
> THIS HAS ALSO been requested on the avro-user thread: http://grokbase.com/t/avro/user/122m4sjm1y/is-it-possible-to-append-to-an-already-existing-avro-file
> I just could not find the JIRA ticket for this request.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira