That sounds about right, Josh. Peter, in our internal testing we have seen this test failing
and increasing timeouts (look at the test code options to do with increasing timeout) helped
quite some.
________________________________________
From: Josh Elser <josh.elser@gmail.com>
Sent: Wednesday, June 14, 2017 3:17 PM
To: dev@hbase.apache.org
Subject: Re: Problem with IntegrationTestRegionReplicaReplication
On 6/14/17 3:53 AM, Peter Somogyi wrote:
> Hi,
>
> As one of my first task with HBase I started to look into
> why IntegrationTestRegionReplicaReplication fails. I would like to get some
> suggestions from you.
>
> I noticed when I run the test using normal cluster or minicluster I get the
> same error messages: "Error checking data for key [null], no data
> returned". I looked into the code and here are my conclusions.
>
> There are multiple threads writing data parallel which are read by multiple
> reader threads simultaneously. Each writer gets a portion of the keys to
> write (e.g. 0-2000) and these keys are added to a ConstantDelayQueue.
> The reader threads get the elements (e.g. key=1000) from the queue and
> these reader threads assume that all the keys up to this are already in the
> database. Since we're using multiple writers it can happen that another
> thread has not yet written key=500 and verifying these keys will cause the
> test failure.
>
> Do you think my assumption is correct?
Hi Peter,
No, as my memory serves, this is not correct. Readers are not made aware
of keys to verify until the write occur plus some delay. The delay is
used to provide enough time for the internal region replication to take
effect.
So: primary-write, pause, [region replication happens in background],
add updated key to read queue, reader gets key from queue verifies the
value on a replica.
The primary should always have seen the new value for a key. If the test
is showing that a replica does not see the result, it's either a timing
issue (you need to give a larger delay for HBase to perform the region
replication) or a bug in the region replication framework itself. That
said, if you can show that you are seeing what you describe, that sounds
like the test framework itself is broken :)