We're looking at creating a Cascading Scheme for Avro, and have got a few questions below. These are very general, as this is more of a scoping phase (as in, are we crazy to try this) so apologies in advance for lack of detail.

For context, Cascading is an open source project that provides a workflow API on top of Hadoop. The key unit of data is a tuple, which corresponds to a record - you have fields (names) and values. Cascading uses a generalized "tap" concept for reading & writing tuples, where a tap uses a scheme to handle the low-level mapping from Cascading-land to/from the storage format.

So the goal here is to define a Cascading Scheme that will run on 0.18.3 and later versions of Hadoop, and provide general support for reading/writing tuples from/to an Avro-format Hadoop part-xxxxx file.

We grabbed the recently committed AvroXXX code from org.apache.avro.mapred (thanks Doug & Scott), and began building the Cascading scheme to bridge between AvroWrapper<T> keys and Cascading tuples.

1. What's the best approach if we want to dynamically define the Avro schema, based on a list of field names and types (classes)?

This assumes it's possible to dynamically define & use a schema, of course.

2. How much has the new Hadoop map-reduce support code been tested?

3. Will there be issues with running in 0.18.3, 0.19.2, etc?

I saw some discussion about Hadoop using the older Jackson 1.0.1 jar, and that then creating problems. Anything else?

4. The key integration point, besides the fields+classes to schema issue above, is mapping between Cascading tuples and AvroWrapper<T>

If we're using (I assume) the generic format, any input on how we'd do this two-way conversion?

> Hi all,> > We're looking at creating a Cascading Scheme for Avro, and have got a > few questions below. These are very general, as this is more of a > scoping phase (as in, are we crazy to try this) so apologies in > advance for lack of detail.> > For context, Cascading is an open source project that provides a > workflow API on top of Hadoop. The key unit of data is a tuple, which > corresponds to a record - you have fields (names) and values. > Cascading uses a generalized "tap" concept for reading & writing > tuples, where a tap uses a scheme to handle the low-level mapping from > Cascading-land to/from the storage format.

I am somewhat familiar with Cascading as a user. I am not familiar with how it is implemented or how to customize things like a Tap or Sink.

Correct me if I'm wrong, but its notion of a record is very simple -- there are no arrays or maps -- just a list of fields.This maps to avro easily.

> > So the goal here is to define a Cascading Scheme that will run on > 0.18.3 and later versions of Hadoop, and provide general support for > reading/writing tuples from/to an Avro-format Hadoop part-xxxxx file.> > We grabbed the recently committed AvroXXX code from > org.apache.avro.mapred (thanks Doug & Scott), and began building the > Cascading scheme to bridge between AvroWrapper<T> keys and Cascading > tuples.

You might be fine without the org.apache.avro.mapred stuff -- specifically if you only need the sinks and taps to use Avro and not the stuff in between a map and reduce. For example, I have a custom LoadFunc in Pig that can read/write avro data files working off Avro 1.3.0 -- but it works for a static schema.

> > 1. What's the best approach if we want to dynamically define the Avro > schema, based on a list of field names and types (classes)?>

Creating an Avro schema programmatically is fairly straightforward -- especially without arrays, maps, or unions. If the code has access to the Cascading record definition, transforming that into an Avro schema dynamically should be straightforward. Schema has various constructors and static methods from which you can get the JSON schema representation or just pass around Schema objects.> This assumes it's possible to dynamically define & use a schema, of > course.> > 2. How much has the new Hadoop map-reduce support code been tested?>

I can't speak for all of what Doug has done here, but there are unit tests for basic stuff -- word count, etc.> 3. Will there be issues with running in 0.18.3, 0.19.2, etc?> > I saw some discussion about Hadoop using the older Jackson 1.0.1 jar, > and that then creating problems. Anything else?

I'm using Avro 1.3.0 with 0.19.2 and 0.20.1 CDH2 in production and the only problem was the above library conflict. This is without the new o.a.avro.mapred stuff however.

> > 4. The key integration point, besides the fields+classes to schema > issue above, is mapping between Cascading tuples and AvroWrapper<T>> > If we're using (I assume) the generic format, any input on how we'd do > this two-way conversion?>

I'd suggest thinking about using Avro container files for input and output, which may not require the above depending on how Cascading is built internally. In Pig for example, the LoadFunc defines a pig schema on input for reading, and everything else from there requires no change -- although this means that it is using the default pig types and serialization for all the intermediate work, reading and writing inputs and outputs can be done with Avro with minimal effort. Cascading is already defining the M/R jobs, the keys, values, etc... so you may only have to modify the Tap to translate from an Avro schema to the Cascading record to get it to read or write an Avro file.

One can go farther and use AvroWrapper and o.a.avro.mapred define the M/R jobs enabling a lot of other possibilities. I can't confidently state what all the requirements are here outside of doing the Cascading record <> Avro schema translation and changing all the touch points that Cascading has on the K/V types.

>> We're looking at creating a Cascading Scheme for Avro, and have got a>> few questions below. These are very general, as this is more of a>> scoping phase (as in, are we crazy to try this) so apologies in>> advance for lack of detail.>>>> For context, Cascading is an open source project that provides a>> workflow API on top of Hadoop. The key unit of data is a tuple, which>> corresponds to a record - you have fields (names) and values.>> Cascading uses a generalized "tap" concept for reading & writing>> tuples, where a tap uses a scheme to handle the low-level mapping >> from>> Cascading-land to/from the storage format.>> I am somewhat familiar with Cascading as a user. I am not familiar > with how it is implemented or how to customize things like a Tap or > Sink.>> Correct me if I'm wrong, but its notion of a record is very simple > -- there are no arrays or maps -- just a list of fields.> This maps to avro easily.

Correct - currently Cascading doesn't have built-in support for arrays, maps or unions - though I believe arrays & maps are on the list.

>> So the goal here is to define a Cascading Scheme that will run on>> 0.18.3 and later versions of Hadoop, and provide general support for>> reading/writing tuples from/to an Avro-format Hadoop part-xxxxx file.>>>> We grabbed the recently committed AvroXXX code from>> org.apache.avro.mapred (thanks Doug & Scott), and began building the>> Cascading scheme to bridge between AvroWrapper<T> keys and Cascading>> tuples.>> You might be fine without the org.apache.avro.mapred stuff -- > specifically if you only need the sinks and taps to use Avro and not > the stuff in between a map and reduce. For example, I have a custom > LoadFunc in Pig that can read/write avro data files working off Avro > 1.3.0 -- but it works for a static schema.>>>>> 1. What's the best approach if we want to dynamically define the Avro>> schema, based on a list of field names and types (classes)?>>>> Creating an Avro schema programmatically is fairly straightforward > -- especially without arrays, maps, or unions. If the code has > access to the Cascading record definition, transforming that into an > Avro schema dynamically should be straightforward. Schema has > various constructors and static methods from which you can get the > JSON schema representation or just pass around Schema objects.

We're currently using the string rep, since a Schema isn't serializable, and Cascading needs that to save the defined workflow in the job conf.

[snip]

>> 3. Will there be issues with running in 0.18.3, 0.19.2, etc?>>>> I saw some discussion about Hadoop using the older Jackson 1.0.1 jar,>> and that then creating problems. Anything else?>> I'm using Avro 1.3.0 with 0.19.2 and 0.20.1 CDH2 in production and > the only problem was the above library conflict. This is without > the new o.a.avro.mapred stuff however.

Great, good to know.

>> 4. The key integration point, besides the fields+classes to schema>> issue above, is mapping between Cascading tuples and AvroWrapper<T>>>>> If we're using (I assume) the generic format, any input on how we'd >> do>> this two-way conversion?>>>> I'd suggest thinking about using Avro container files for input and > output, which may not require the above depending on how Cascading > is built internally. In Pig for example, the LoadFunc defines a pig > schema on input for reading, and everything else from there requires > no change -- although this means that it is using the default pig > types and serialization for all the intermediate work, reading and > writing inputs and outputs can be done with Avro with minimal effort.> Cascading is already defining the M/R jobs, the keys, values, etc... > so you may only have to modify the Tap to translate from an Avro > schema to the Cascading record to get it to read or write an Avro

So far one issue is that we need to translate between Cascading Strings and Avro Utf8 types, but most everything else works just fine.It's pretty much four routines in the scheme:

- sinkInit (setting up the conf properly, for which we're using the AvroJob support)- sourceInit (same thing)

- sink (mapping from Tuple to o.a.avro.Generic.GenericData)- source (mapping from o.a.avro.Generic.GenericData to Tuple)

The above is all based on the Avro mapred support, so we just have to do the translation work for Fields <-> Schema and Tuple <-> GenericData.

> Hi Scott,> > Thanks for the response. See below for my comments...> >> >> Correct me if I'm wrong, but its notion of a record is very simple >> -- there are no arrays or maps -- just a list of fields.>> This maps to avro easily.> > Correct - currently Cascading doesn't have built-in support for > arrays, maps or unions - though I believe arrays & maps are on the list.>

It would be great if Cascading, Pig, and Hive (along with Avro) could get to some good common ground on all of these data types.>> Creating an Avro schema programmatically is fairly straightforward >> -- especially without arrays, maps, or unions. If the code has >> access to the Cascading record definition, transforming that into an >> Avro schema dynamically should be straightforward. Schema has >> various constructors and static methods from which you can get the >> JSON schema representation or just pass around Schema objects.> > We're currently using the string rep, since a Schema isn't > serializable, and Cascading needs that to save the defined workflow in > the job conf.>

That should work well. The JSON string representation is the canonical, cross-language, serialization of an Avro schema.

> > So far one issue is that we need to translate between Cascading > Strings and Avro Utf8 types, but most everything else works just fine.>

Let us know about the difficulties here and any suggestions or requests for enhancement. I am interested in making the String <> Utf8 situation more efficient and easier to use.>> One can go farther and use AvroWrapper and o.a.avro.mapred define >> the M/R jobs enabling a lot of other possibilities. I can't >> confidently state what all the requirements are here outside of >> doing the Cascading record <> Avro schema translation and changing >> all the touch points that Cascading has on the K/V types.> > It's pretty much four routines in the scheme:> > - sinkInit (setting up the conf properly, for which we're using the > AvroJob support)> - sourceInit (same thing)> > - sink (mapping from Tuple to o.a.avro.Generic.GenericData)> - source (mapping from o.a.avro.Generic.GenericData to Tuple)> > The above is all based on the Avro mapred support, so we just have to > do the translation work for Fields <-> Schema and Tuple <-> GenericData.> > It looks pretty doable, thanks for the help!> > -- Ken> > --------------------------------------------> Ken Krugler> +1 530-210-6378> http://bixolabs.com> e l a s t i c w e b m i n i n g> > > >

> We're looking at creating a Cascading Scheme for Avro, and have got > a few questions below. These are very general, as this is more of a > scoping phase (as in, are we crazy to try this) so apologies in > advance for lack of detail.>> For context, Cascading is an open source project that provides a > workflow API on top of Hadoop. The key unit of data is a tuple, > which corresponds to a record - you have fields (names) and values. > Cascading uses a generalized "tap" concept for reading & writing > tuples, where a tap uses a scheme to handle the low-level mapping > from Cascading-land to/from the storage format.>> So the goal here is to define a Cascading Scheme that will run on > 0.18.3 and later versions of Hadoop, and provide general support for > reading/writing tuples from/to an Avro-format Hadoop part-xxxxx file.>> We grabbed the recently committed AvroXXX code from > org.apache.avro.mapred (thanks Doug & Scott), and began building the > Cascading scheme to bridge between AvroWrapper<T> keys and Cascading > tuples.

One open issue - it would be great to be able to set metadata in the headers of the resulting Avro files. But it wasn't obvious how to do that, given our (intentionally) arms-length approach via the use of the Avro mapred code.

One idea would be to have job conf values using keys prefixed with avro.metadata.xxx, and the Avro mapred support could automagically use that when creating the file. But this would break our goal of using unmodified Avro source, so I'm curious whether support for setting the file metadata would also be useful for the standard (Hadoop) use of Avro for an output format, and if so, whether there was a better approach.

Ken Krugler wrote:> One open issue - it would be great to be able to set metadata in the > headers of the resulting Avro files. But it wasn't obvious how to do > that, given our (intentionally) arms-length approach via the use of the > Avro mapred code.> > One idea would be to have job conf values using keys prefixed with > avro.metadata.xxx, and the Avro mapred support could automagically use > that when creating the file. But this would break our goal of using > unmodified Avro source, so I'm curious whether support for setting the > file metadata would also be useful for the standard (Hadoop) use of Avro > for an output format, and if so, whether there was a better approach.

Embedding the metadata in the configuration seems like a good approach. Please file a Jira issue for this and attach a patch.

AvroOutputFormat can add properties named avro.mapred.output.metadata.*. We'll have to enumerate all properties in the job and test for this prefix, since Configuration is a HashMap, but the alternative of encoding the metadata map in a single configuration value seems no more attractive.

> Ken Krugler wrote:>> One open issue - it would be great to be able to set metadata in >> the headers of the resulting Avro files. But it wasn't obvious how >> to do that, given our (intentionally) arms-length approach via the >> use of the Avro mapred code.>> One idea would be to have job conf values using keys prefixed with >> avro.metadata.xxx, and the Avro mapred support could automagically >> use that when creating the file. But this would break our goal of >> using unmodified Avro source, so I'm curious whether support for >> setting the file metadata would also be useful for the standard >> (Hadoop) use of Avro for an output format, and if so, whether there >> was a better approach.>> Embedding the metadata in the configuration seems like a good > approach. Please file a Jira issue for this and attach a patch.>> AvroOutputFormat can add properties named > avro.mapred.output.metadata.*. We'll have to enumerate all > properties in the job and test for this prefix, since Configuration > is a HashMap, but the alternative of encoding the metadata map in a > single configuration value seems no more attractive.>> Note that https://issues.apache.org/jira/browse/HADOOP-6420 added > support for adding maps to configuration, but the extracted map > cannot be enumerated, so could not be added to the DataFileWriter's > metadata. Also, this feature is perhaps slated for removal as a part > of https://issues.apache.org/jira/browse/HADOOP-6698, but its code > might prove useful as a starting point.

Thanks for the info, we'll work up a patch & file the issue when it's ready.

Two related questions:

1. I'm assuming there's no compelling reason to read the file headers - in fact, not sure how you'd even get at the data, much less how you'd deal with potentially partial/missing data from a set of Avro files being read as part files.

2. We'd like to not include Avro source in the Cascading scheme project, but rather just have a dependency on the Avro jar.

We have a similar relationship between Bixo and Tika, and what's worked well is for the Bixo master branch to have a dependency on the Tika snapshot builds, so we can quickly iterate on both projects.

So are there plans to start pushing Avro snapshot builds to the Apache snapshots repository? I see occasional Avro releases to the Maven central repo (1.0, 1.2, 1.3.2) but nothing for snapshots.

Ken Krugler wrote:> 1. I'm assuming there's no compelling reason to read the file headers - > in fact, not sure how you'd even get at the data, much less how you'd > deal with potentially partial/missing data from a set of Avro files > being read as part files.

I'm not sure what you're asking here.

> 2. We'd like to not include Avro source in the Cascading scheme project, > but rather just have a dependency on the Avro jar.> > We have a similar relationship between Bixo and Tika, and what's worked > well is for the Bixo master branch to have a dependency on the Tika > snapshot builds, so we can quickly iterate on both projects.> > So are there plans to start pushing Avro snapshot builds to the Apache > snapshots repository? I see occasional Avro releases to the Maven > central repo (1.0, 1.2, 1.3.2) but nothing for snapshots.

I'm okay if someone wants to, e.g., configure a nightly Hudson build that pushes out an Avro snapshot jar. Apache releases should not depend on snapshots, but snapshots are useful for development.

Avro's build.xml already includes a task to post a snapshot jar. I tested it once, which accounts for the single Avro snapshot that exists. So it should be simple to configure Hudson to do this. Philip was going to setup Hudson builds for Avro. Philip?

> Ken Krugler wrote:>> 1. I'm assuming there's no compelling reason to read the file >> headers - in fact, not sure how you'd even get at the data, much >> less how you'd deal with potentially partial/missing data from a >> set of Avro files being read as part files.>> I'm not sure what you're asking here.

Sorry, I should have been clearer.

I was thinking about the read side of things, when using the Cascading Scheme to pull data from Avro files. If these files have metadata, there's no good way to get at it via the Cascading interface, and given that a directory will typically contain a set of part-xxxxx files, it didn't seem like you could do much with the results in any case. So just checking to make sure I wasn't overlooking something.

>> 2. We'd like to not include Avro source in the Cascading scheme >> project, but rather just have a dependency on the Avro jar.>> We have a similar relationship between Bixo and Tika, and what's >> worked well is for the Bixo master branch to have a dependency on >> the Tika snapshot builds, so we can quickly iterate on both projects.>> So are there plans to start pushing Avro snapshot builds to the >> Apache snapshots repository? I see occasional Avro releases to the >> Maven central repo (1.0, 1.2, 1.3.2) but nothing for snapshots.>> I'm okay if someone wants to, e.g., configure a nightly Hudson build > that pushes out an Avro snapshot jar. Apache releases should not > depend on snapshots, but snapshots are useful for development.>> Avro's build.xml already includes a task to post a snapshot jar. I > tested it once, which accounts for the single Avro snapshot that > exists. So it should be simple to configure Hudson to do this. > Philip was going to setup Hudson builds for Avro. Philip?