define and implement mapreduce connector protocol

Details

Description

Avro should provide Hadoop Mapper and Reducer implementations that connect to a subprocess in another programming language, transmitting raw binary values to and from that process. This should be modeled after Hadoop Pipes. It would allow one to easily write efficient mapreduce programs in non-Java languages that process Avro-format data.

I think it is a very bad idea to make Avro depend on MapReduce. At the very least, please create a separate jar to put this stuff into rather than the main Avro jar so that MapReduce doesn't depend on a jar that depends on it.

Why not build this as a library on MapReduce? It shouldn't be bundled with Avro...

Owen O'Malley
added a comment - 22/Apr/10 04:23 I think it is a very bad idea to make Avro depend on MapReduce. At the very least, please create a separate jar to put this stuff into rather than the main Avro jar so that MapReduce doesn't depend on a jar that depends on it.
Why not build this as a library on MapReduce? It shouldn't be bundled with Avro...

I agree that Avro should not require MapReduce – specifically the maven POM should not cause consumers to pull MapReduce by default.

But, I think we already prevent that. The POM generated by the build specifies hadoop-core as "optional" meaning downstream projects that consume Avro won't automatically pull the Hadoop jar. Another option for similar effect is to specify the dependency scope as "provided" instead of "compile" which makes the jar available for build and test but does not bundle it. This is probably preferred for MapReduce. If a user wants to use those APIs, they have to get a copy of their own hadoop-core jar or specify the dependency themselves.

Putting the code in Hadoop is probably a problem, unless we want to release new versions of 0.18, 0.19, 0.20, etc. Placing it in Hadoop means that changes to the Avro lower level APIs will break compatibility with the version in Hadoop. Honestly, some of those APIs are going to keep evolving and dot-releases of AVRO can break these APIs (but not encoded formats). Until these APIs are more locked down it is better to keep packages like this in the Avro project.

-----------
Going slightly off topic now:

A few other libraries Avro bundles have similar issues – optional side features should specify either "provided" or "optional" flags in the maven pom. Or, the project needs to be split up into a few jars.

Scott Carey
added a comment - 22/Apr/10 05:12 - edited I agree that Avro should not require MapReduce – specifically the maven POM should not cause consumers to pull MapReduce by default.
But, I think we already prevent that. The POM generated by the build specifies hadoop-core as "optional" meaning downstream projects that consume Avro won't automatically pull the Hadoop jar. Another option for similar effect is to specify the dependency scope as "provided" instead of "compile" which makes the jar available for build and test but does not bundle it. This is probably preferred for MapReduce. If a user wants to use those APIs, they have to get a copy of their own hadoop-core jar or specify the dependency themselves.
Putting the code in Hadoop is probably a problem, unless we want to release new versions of 0.18, 0.19, 0.20, etc. Placing it in Hadoop means that changes to the Avro lower level APIs will break compatibility with the version in Hadoop. Honestly, some of those APIs are going to keep evolving and dot-releases of AVRO can break these APIs (but not encoded formats). Until these APIs are more locked down it is better to keep packages like this in the Avro project.
-----------
Going slightly off topic now:
A few other libraries Avro bundles have similar issues – optional side features should specify either "provided" or "optional" flags in the maven pom. Or, the project needs to be split up into a few jars.
avro-core
-> avro-genavro
-> avro-protocol
-> avro-mapred
-> avro-reflect
probably covers the main dependency chunks. Avro-core can get away with only jackson, slf4j, and commons-lang, I think – meaning generic, and specific APIs, file formats, etc work.

Doug Cutting
added a comment - 14/Jun/10 18:44 I'm going to commit this later today, so that we can start trying to implement it in other languages. I've updated the documentation to note that this is an experimental feature, subject to change.

I think a deadlock can occur if the subprocess fails to start (e.g if the executable is specified incorrectly). This happens because the constructor for TetheredProcess starts the subprocess and then calls outputService.inputPort(). But inputPort() will block until the child process sends a configure message to the parent; but if the child process wasn't started then I think the parent deadlocks.

At a minimum we could check that the subprocess hasn't exited yet. This probably won't prevent all possible deadlocks but it might help.

Below is some code for checking if the process has exited.
//is there a better way to check if the process has exited then the roundabout way below?
boolean hasexited=false;
try

Jeremy Lewi
added a comment - 28/Jun/11 15:13 I think a deadlock can occur if the subprocess fails to start (e.g if the executable is specified incorrectly). This happens because the constructor for TetheredProcess starts the subprocess and then calls outputService.inputPort(). But inputPort() will block until the child process sends a configure message to the parent; but if the child process wasn't started then I think the parent deadlocks.
At a minimum we could check that the subprocess hasn't exited yet. This probably won't prevent all possible deadlocks but it might help.
Below is some code for checking if the process has exited.
//is there a better way to check if the process has exited then the roundabout way below?
boolean hasexited=false;
try
{
//exitValue throws an exception if process hasn't exited
this.subprocess.exitValue();
hasexited=true;
}
catch (IllegalThreadStateException e)
{
//it hasn't exited yet
hasexited=true;
}
if (hasexited)
{
//What's the best way to log this
System.out.println("Error: Could not start subprocess");
throw new RuntimeException("Error: Could not start subprocess");
}