hadoop-general mailing list archives

>
>
> For example, in your last MapReduce (MAPREDUCE-980) patch you
>> added avro and paranamer as dependences.
>>
>
> If I'm not mistaken, that only adds a dependency to the JobTracker. We
> don't create specific classpaths for daemons than for user code, but we
> probably should, so that things that only the daemon uses are not also
> placed on the users classpath.
>
>
+1 to separate classpaths for daemons as an eventual goal. As a user, I've
definitely lost an afternoon to commons-lang version mismatches. If we can
add fewer things to the Task classpath, that's fewer potential future lost
afternoons.
I'm not a PMC member, but I suspect I spend more time doing user-level grunt
work than many PMC members, so from that perspective:
On internal-to-hadoop serialization:
I'm going to spend 99% of my time not caring about these formats and the
other 1% of the time needing to know what's going on with them
*immediately*. Right now I know nothing about protobuf.. learning new
things is always great but "while my production job is broken and I'm trying
to debug it" isn't really going to be the best time and place for it. JSON
on the other hand is human readable and never going to change. I feel a lot
safer with JSON than with any binary format, especially considering that we
could all be using NewHawtUnforeseenLibrary or
IncompatibleWithPreviousReleaseLibrary for our binary serialization in a
couple years.
On packaging serialization lib dependencies:
Again, additional versioned dependencies on the Task classpath scare me, and
that goes double for serialization. I could see a couple ways around it
that fall prey to the inner-framework antipattern, and for what it's worth,
I'd be willing to accept that additional kludginess if it meant that I
wasn't strictly dependent on avro x.x or thrift y.y. What if I'm reading a
file that was encoded with an incompatible version? This gets way out of
scope from the immediate issue but if I could ship my own serialization
library in an assembly jar, and maybe override an additional method or
supply a MapOutputEncoder or something, I'd take that tradeoff over being
bound to a particular version until the next version of Hadoop comes out.
If there were sensible defaults in place, it might not even mean more
complexity for the average job.