I've also noticed something really strange in the job's output. It looks like it's starting over or redoing things.
This was run using all six nodes and no limitations on map or reduce tasks. I haven't seen this behavior in any other case.

Activity

I ran into a similar problem while trying to debug/implement support for compressed input files. I believe that the problem is in StreamXmlRecordReader.java in the fast-match code. When the bytestream doesn't completely match the begin/end pattern, the code does "pos_ += m" to increment the position in the stream by the number of characters matching the pattern... but it forgets the "c" character which didn't match. Ultimately, the code ends up reading significantly past the end of the split, resulting in the same record (at least the first one right after the split) being detected by multiple StreamXmlRecordReader instances.

The fix is simple. The line mentioned above should be: pos_ += m + 1;

I'm planning to submit this and my gzip-support as soon as I can figure out how to submit a patch.

Bo Adler
added a comment - 14/Jun/08 09:48 I ran into a similar problem while trying to debug/implement support for compressed input files. I believe that the problem is in StreamXmlRecordReader.java in the fast-match code. When the bytestream doesn't completely match the begin/end pattern, the code does "pos_ += m" to increment the position in the stream by the number of characters matching the pattern... but it forgets the "c" character which didn't match. Ultimately, the code ends up reading significantly past the end of the split, resulting in the same record (at least the first one right after the split) being detected by multiple StreamXmlRecordReader instances.
The fix is simple. The line mentioned above should be: pos_ += m + 1;
I'm planning to submit this and my gzip-support as soon as I can figure out how to submit a patch.

[ Here is a patch to create a test case that demonstrates the problem. In my personal testing, it was very easy to run into the problem, but I had to tweak the block size to trigger the error in the junit tests.
]

Here is a patch to create a test case that demonstrates the problem. In my personal testing, it was very easy to run into the problem, but I had to tweak the block size to trigger the error in the junit tests.

Bo Adler
added a comment - 01/Jul/08 22:57 Here is a patch to create a test case that demonstrates the problem. In my personal testing, it was very easy to run into the problem, but I had to tweak the block size to trigger the error in the junit tests.
(sorry about the duplicate comment)

Hadoop QA
added a comment - 01/Jul/08 23:06 -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12385056/0002-patch-for-HADOOP-3484.patch
against trunk revision 673215.
+1 @author. The patch does not contain any @author tags.
-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no tests are needed for this patch.
-1 patch. The patch command could not apply the patch.
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2777/console
This message is automatically generated.

Lohit Vijayarenu
added a comment - 07/Jul/08 22:20 Bo, This looks good.
Few minor changes.
Test depends on perl. In testcase, would it be possible to check perl and terminate if there isnt on system running this, else it might be difficult to find the test failure.
Could we move the testcase to already existing test file TestStreamXmlRecordReader.java instead of creating a new one?

I also found the missing "+1" bug a few days ago. However, there are still duplicates even if the proposed patch (pos_ += m + 1) is applied.

The duplicates happen when the assigned split boundary (end_) is between an end tag and another beginning tag (for example, there are several newlines between two records). In this case, after reading the last record in this split, the pos_ is still smaller than end_. Therefore, the reader will continue trying to find another beginning tag which may actually resides in another split. In this case, both splits will return the record, causing duplicates.

In addition, the pos_ variable does not present the real position in the situation that there are additional bytes between records, even if the proposed patch is applied. This is because pos_ is only updated when trying to find the end tag. If there are several bytes skipped during the process of finding the beginning tag and the pos_ is not updated, the position will be wrong.

My solution should be able to solve the two problems. If you agree with me and think my solution works, feel free to use my code. I willl be glad if the problem gets fixed in the next release.

Wei-Ming Chen
added a comment - 12/Jul/08 05:32 Hi,
I also found the missing "+1" bug a few days ago. However, there are still duplicates even if the proposed patch (pos_ += m + 1) is applied.
The duplicates happen when the assigned split boundary (end_) is between an end tag and another beginning tag (for example, there are several newlines between two records). In this case, after reading the last record in this split, the pos_ is still smaller than end_. Therefore, the reader will continue trying to find another beginning tag which may actually resides in another split. In this case, both splits will return the record, causing duplicates.
In addition, the pos_ variable does not present the real position in the situation that there are additional bytes between records, even if the proposed patch is applied. This is because pos_ is only updated when trying to find the end tag. If there are several bytes skipped during the process of finding the beginning tag and the pos_ is not updated, the position will be wrong.
My solution should be able to solve the two problems. If you agree with me and think my solution works, feel free to use my code. I willl be glad if the problem gets fixed in the next release.
StreamXmlRecordReader.java
boolean fastReadUntilMatch( String textPat, boolean includePat, DataOutputBuffer outBufOrNull) throws IOException {
byte [] cpat = textPat.getBytes( "UTF-8" );
int m = 0;
boolean match = false ;
int msup = cpat.length;
int LL = 120000 * 10;
bin_.mark(LL); // large number to invalidate mark
while ( true ) {
int b = bin_.read();
if (b == -1) break ;
byte c = ( byte ) b; // this assumes eight-bit matching. OK with UTF-8
if (c == cpat[m]) {
m++;
if (m == msup) {
match = true ;
break ;
}
} else {
bin_.mark(LL); // rest mark so we could jump back if we found a match
if (outBufOrNull != null ) {
outBufOrNull.write(cpat, 0, m);
outBufOrNull.write(c);
- pos_ += m;
+ pos_ += m + 1;
+ } else {
+ pos_ += m + 1;
+ if (pos_ >= end_)
+ break ;
}
m = 0;
}
}
if (!includePat && match) {
bin_.reset();
} else if (outBufOrNull != null ) {
outBufOrNull.write(cpat);
pos_ += msup;
}
return match;
}

Unfortunately, I'm working on a paper deadline, so I won't get to this for another week. But I think there's something still wrong here. This new change assumes that we want to stop reading at the split boundary, but that's not true. If the split boundary falls in the middle of a record, then the preceding split needs to go past the boundary to "see" the whole record. Additionally, this only fixes the fast match, but I'm thinking (haven't checked) that a similar issue exists with the slow match... so this check needs to happen in next(), not inside the match routines.

I think that means we should have a few more test cases, to check all these variations. Lohit: I was hoping someone else would take up the charge of doing "the right thing" with my patches. I can certainly take a stab at creating new tests for Wei-Ming's case and merging all these new tests into TestStreamXmlReacord.java, if no one beats me to it.

Bo Adler
added a comment - 14/Jul/08 05:49 Hmm, yeah, that seems plausible.
Unfortunately, I'm working on a paper deadline, so I won't get to this for another week. But I think there's something still wrong here. This new change assumes that we want to stop reading at the split boundary, but that's not true. If the split boundary falls in the middle of a record, then the preceding split needs to go past the boundary to "see" the whole record. Additionally, this only fixes the fast match, but I'm thinking (haven't checked) that a similar issue exists with the slow match... so this check needs to happen in next(), not inside the match routines.
I think that means we should have a few more test cases, to check all these variations. Lohit: I was hoping someone else would take up the charge of doing "the right thing" with my patches. I can certainly take a stab at creating new tests for Wei-Ming's case and merging all these new tests into TestStreamXmlReacord.java, if no one beats me to it.

My solution should stop reading only when trying to find a beginning tag. When it tries to find an end tag, the outBufOrNull will not be null and the boundary-checking code will not be reached. Therefore, the whole record can be returned. However, all these assume outBufOrNull can be used to discern whether it's trying to find a beginning tag or an end tag.

I have not checked the slow match either. However, if the boundary-checking can be done inside the match function, it should help some performance improvement.

Wei-Ming Chen
added a comment - 14/Jul/08 17:51 My solution should stop reading only when trying to find a beginning tag. When it tries to find an end tag, the outBufOrNull will not be null and the boundary-checking code will not be reached. Therefore, the whole record can be returned. However, all these assume outBufOrNull can be used to discern whether it's trying to find a beginning tag or an end tag.
I have not checked the slow match either. However, if the boundary-checking can be done inside the match function, it should help some performance improvement.

Ah, okay, I see what you're doing. I agree that the performance would be slightly faster inside the match function. But I also feel like the code will be harder to maintain, because this "check for split boundary" has to happen in both slow and fast match cases.

On the issue of test cases: I managed to create a test case that demonstrates the bug that Wei-Ming addresses, but I'm having a problem combining the test cases into a single file. When I do that, the tests pass – as if a single jobconf was being used for all the tests. Any ideas on what I might be doing wrong?

Bo Adler
added a comment - 21/Jul/08 03:51 Ah, okay, I see what you're doing. I agree that the performance would be slightly faster inside the match function. But I also feel like the code will be harder to maintain, because this "check for split boundary" has to happen in both slow and fast match cases.
On the issue of test cases: I managed to create a test case that demonstrates the bug that Wei-Ming addresses, but I'm having a problem combining the test cases into a single file. When I do that, the tests pass – as if a single jobconf was being used for all the tests. Any ideas on what I might be doing wrong?

Turns out that the problem is with org.apache.hadoop.fs.FileSystem. There is a cache object inside of it, which remembers the LocalFileSystem that is created; even though I set "-jobconf fs.local.block.size=xx" in subsequent jobs, the original object is kept so the override doesn't happen.

There is a FileSystem.closeAll() method, which seems to erase the cache. I tried calling this at the end of each job, so that the LocalFileSystem is recreated with the new blocksize (I have three blocksizes that I test). This works the first time, but not the second, so that the third blocksize is the same as the second. At this point, I'd say that the easiest way forward is to have a separate file for each test, since I need to change the blocksize to elicit the two errors.

Bo Adler
added a comment - 21/Jul/08 08:04 Turns out that the problem is with org.apache.hadoop.fs.FileSystem. There is a cache object inside of it, which remembers the LocalFileSystem that is created; even though I set "-jobconf fs.local.block.size=xx" in subsequent jobs, the original object is kept so the override doesn't happen.
There is a FileSystem.closeAll() method, which seems to erase the cache. I tried calling this at the end of each job, so that the LocalFileSystem is recreated with the new blocksize (I have three blocksizes that I test). This works the first time, but not the second, so that the third blocksize is the same as the second. At this point, I'd say that the easiest way forward is to have a separate file for each test, since I need to change the blocksize to elicit the two errors.

Here is my latest patch that implements tests demonstrating the problem, and how I think it should be fixed (including the info from Wei-Ming). I did not implement the early-quit discussed above, but it's easy to do for the person doing the actual commit.

The tests are done as separate files, to avoid the FileSystem caching that I mentioned earlier.

Bo Adler
added a comment - 21/Jul/08 09:31 Here is my latest patch that implements tests demonstrating the problem, and how I think it should be fixed (including the info from Wei-Ming). I did not implement the early-quit discussed above, but it's easy to do for the person doing the actual commit.
The tests are done as separate files, to avoid the FileSystem caching that I mentioned earlier.

I've just tried the fix described by Wei-Ching in it's July 11 post to process the wikipedia articles dump and it works perfectly (instead of having maps not ending even tho they are at 1600%). This really should be committed.

Jean-Daniel Cryans
added a comment - 01/Apr/09 22:08 I've just tried the fix described by Wei-Ching in it's July 11 post to process the wikipedia articles dump and it works perfectly (instead of having maps not ending even tho they are at 1600%). This really should be committed.

Tom White
added a comment - 09/Sep/09 12:06 Is the latest patch the one with the desired fix? If so then please mark "Patch Available" to run it through Hudson, otherwise please post up a new patch first. Thanks.

Attaching patch for trunk. This patch is same as the earlier patch HADOOP-3484.try3.patch except fixing TestStreamXmlMultiInner.java to have fs.local.block.size=59 instead of 80 sothat the testcase fails without the fix of this patch.

Ravi Gummadi
added a comment - 18/Jun/10 11:26 Attaching patch for trunk. This patch is same as the earlier patch HADOOP-3484 .try3.patch except fixing TestStreamXmlMultiInner.java to have fs.local.block.size=59 instead of 80 sothat the testcase fails without the fix of this patch.

It was an issue in input to testcases. The perl scripts seem to be depending on the number of occurences of the word "is". My earlier patch changed the input — causing the testcase failure.

Here is the patch for Y! 20S distribution with the correct input to testcases. Both testcases passed on my local machine in Y! 20S with patch and fail without the fix of this patch.
----------------------

Same patch has an issue in trunk. Number of splits seem to be 1 in TestStreamXmlMultiInner.java even though fs.local.block.size=59. Will investigate.

Ravi Gummadi
added a comment - 21/Jun/10 12:41 It was an issue in input to testcases. The perl scripts seem to be depending on the number of occurences of the word "is". My earlier patch changed the input — causing the testcase failure.
Here is the patch for Y! 20S distribution with the correct input to testcases. Both testcases passed on my local machine in Y! 20S with patch and fail without the fix of this patch.
----------------------
Same patch has an issue in trunk. Number of splits seem to be 1 in TestStreamXmlMultiInner.java even though fs.local.block.size=59. Will investigate.

In trunk, testcases were not picking the block size because in TestStreaming(the base class of the 2 tests of this patch) is creating input file by creating FileSystem object. As we were setting the config fs.local.block.size later, it is not effective for the FileSystem — causing single split in both tests.

Ravi Gummadi
added a comment - 22/Jun/10 08:57 In trunk, testcases were not picking the block size because in TestStreaming(the base class of the 2 tests of this patch) is creating input file by creating FileSystem object. As we were setting the config fs.local.block.size later, it is not effective for the FileSystem — causing single split in both tests.

Ravi Gummadi
added a comment - 22/Jun/10 09:09 Attaching patch for trunk by fixing the testcases so that the configuration used when FileSystem object is created will have fs.local.block.size set to the proper value needed.
Both testcases fail without the fix of the patch and pass with the fix.

Some comments on testcases:
1. Can you put both the tests into single file?
2. Both the tests look similar except changes in input and some configuration parameters. Can you re-use the code in tests by passing them as parameters?
3. Minor: Can you put perl script for mapper and reducer in String and use it?
4. Minor: In UtilsForTest, change LOG(..., StringUtils.StringifyException(e)) to LOG(..., e).

Amareshwari Sriramadasu
added a comment - 24/Jun/10 07:44 Some comments on testcases:
1. Can you put both the tests into single file?
2. Both the tests look similar except changes in input and some configuration parameters. Can you re-use the code in tests by passing them as parameters?
3. Minor: Can you put perl script for mapper and reducer in String and use it?
4. Minor: In UtilsForTest, change LOG(..., StringUtils.StringifyException(e)) to LOG(..., e).

Ravi Gummadi
added a comment - 29/Jun/10 11:53 After merging, the file system block size is not updated properly. So adding FileSystem.closeAll() call at the begimning of each test case. Will upload a patch soon.

Attaching new patch fixing the issue with FileSystem block size setting using FileSystem.closeAll() in the test cases. With this, the block size is properly set to 60, 80, 60 and 80 in the 4 test cases respectively.

Ravi Gummadi
added a comment - 30/Jun/10 07:18 Attaching new patch fixing the issue with FileSystem block size setting using FileSystem.closeAll() in the test cases. With this, the block size is properly set to 60, 80, 60 and 80 in the 4 test cases respectively.

Ming Jin
added a comment - 04/Sep/12 09:58 Hi everyone,
I found the exact same issue in Hadoop v1.0.3( http://fossies.org/dox/hadoop-1.0.3/StreamXmlRecordReader_8java_source.html ).
Is there any plan to fix it in v1.0.3?