This is an issue I've had while using hadoop-0.20.2 which took me a week to
track down. I'm putting this out there so that other people might benefit or
perhaps do a bug fix.
Also, there are actually two issues involved here, only one of which is
linux pipe related
I had some 20K bzip2 text files (about 2MB average) which kept dying at the
reduce stage because I kept getting "java.lang.OutOfMemoryError: GC overhead
limit exceeded" near the end of the jobs. This is the first issue. It turned
out I could work around this by combining the files so that there are fewer
files with the same overall content.
The second issue is what took a week to track down. To combine the smaller
files, I used bzcat, bzip2, and append operations all through linux pipes.
The full script is:
#! /bin/bash
f=$1
root=MYROOT
icounter=0
fcounter=0
tfile=`mktemp`
for subf in ${root}/${f}*
do
if [ ${icounter} -ge 3 ]; then
fcounter=$((fcounter+1))
sleep 2
cat $tfile | bzip2 > input/${f}_${fcounter}.bz2
rm $tfile
tfile=`mktemp`
icounter=0
fi
bzcat ${subf} >> $tfile
echo >> $tfile
icounter=$((icounter+1))
done
I discovered I had a problem because the token counts were way smaller than
what I expected them to be. And the output of the counted tokens on a
smaller subset of the input without hadoop was much larger than what I got
with hadoop. After various dead ends (e.g. removing non-ascii characters,
removing backslashes and potential escape characters), I discovered it was
the append operation (i.e. >>) that was causing problems, and once I dealt
with the file combination task in python, the issue disappeared.
Note that hadoop did not generate any error messages and finished normally
on the pipe appended input. It was just that it ignored a significant
portion of the input without telling me about it. Also, the pipe appended
input did not cause this kind of problem when I did wordcounts on them
without hadoop.