I think the cluster is ok without running hbck, as restarting theregionserver process stops the warnings and everything looks ok otherwise.

here's the regionserver right after the split happens:------------------------2012-11-01 22:45:28,726 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:regionserver:60020-0x13ab46479832953 Attempting to transition nodebc62a8a72124a4ba3f6b9f302587903c from *RS_ZK_R**EGION_SPLITTING to RS_ZK_REGION_SPLIT*2012-11-01 22:45:28,730 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:regionserver:60020-0x13ab46479832953 Successfully transitioned nodebc62a8a72124a4ba3f6b9f302587903c from RS_ZK_REGION_SPLITTING to RS_ZK_REGION_SPLIT2012-11-01 22:45:28,730 DEBUGorg.apache.hadoop.hbase.regionserver.SplitTransaction: Still waiting on themaster to process the split for bc62a8a72124a4ba3f6b9f302587903c2012-11-01 22:45:28,832 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:regionserver:60020-0x13ab46479832953 Attempting to transition nodebc62a8a72124a4ba3f6b9f302587903c from RS_ZK_REGION_SPLIT to RS_ZK_REGION_SPLIT2012-11-01 22:45:28,837 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:regionserver:60020-0x13ab46479832953 Successfully transitioned nodebc62a8a72124a4ba3f6b9f302587903c from RS_ZK_REGION_SPLIT to RS_ZK_REGION_SPLIT-----------------------------

The "transitioned node from RS_ZK_REGION_SPLIT to RS_ZK_REGION_SPLIT"continues for 15 or so hours and finally settles without manualintervention with these regionserver log messages:-----------------------2012-11-02 13:55:00,906 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:regionserver:60020-0x13ab46479832953 Attempting to transition node *bc62a8a72124a4ba3f6b9f302587903c* from RS_ZK_REGION_SPLIT toRS_ZK_REGION_SPLIT

And the master finally logs this:-----------------------2012-11-02 13:55:00,783 WARNorg.apache.hadoop.hbase.master.AssignmentManager: Regionbc62a8a72124a4ba3f6b9f302587903c not found on serverHadoopNode162.hotpads.srv,60020,1351788248279; failed processing

2012-11-02 13:55:00,783 WARNorg.apache.hadoop.hbase.master.AssignmentManager: Received SPLIT for regionbc62a8a72124a4ba3f6b9f302587903c from serverHadoopNode162.hotpads.srv,60020,1351788248279 but it doesn't exist anymore,probably already processed its split------------------------

I can't find any evidence that the region left the node during that time.I'll have to catch it in action next time and see what the region is up toduring the problem period.

What does it mean that it successfully transitioned from SPLIT to SPLIT?Is that a valid transition?

Strangely, I don't see any record of that region in the master before whatI already pasted even though I have logs back to 10/30. Next time ithappens I'll gather a full log record and try to debug while it's occurring.On Sat, Nov 3, 2012 at 7:10 PM, rajesh babu chintaguntla <[EMAIL PROTECTED]> wrote:

Forgot to answer Ted's question - I don't remember exactly what we did tothe cluster. I don't think there were any regionserver crashes, but therewere probably controlled restarts, new servers added to the runningcluster, and some regions moved by our manual balancer job.On Sun, Nov 4, 2012 at 5:16 PM, Matt Corgan <[EMAIL PROTECTED]> wrote:

What I'd like to see is the log from the first time the masterreceives the split message. I guess it says the region doesn't existanymore because the split was processed already in the master butthere's a failure mode similar to 4792.

I saw this on another cluster last week but the logs were rollingbased on size so the original split message was lost.

I don't see that region in the master logs for 4 days back. I'm going toimport some new data from scratch soon and will be sure to keep all themaster logs.On Mon, Nov 5, 2012 at 9:52 AM, Jean-Daniel Cryans <[EMAIL PROTECTED]>wrote:

I'm importing data into another table now. I disabled/enabled the tablefirst to make sure we have the original 4 region locations loggedeverywhere. Will report back...On Mon, Nov 5, 2012 at 2:15 PM, Jean-Daniel Cryans <[EMAIL PROTECTED]>wrote:

It happened in this new table that I have all the logs for. The region inquestion this time is 6839663e4f8f79be3d7469784c21cbc2, and the first traceof this region is on the regionserver with the "Intantiated tableName..."message

I also know this region came from a recent split, but I can't find any logmessages show the parent finishing the split that created this daughterregion. So my guess now is that the split is actually finishing andletting clients continue to write data, but something is failing to printthe log line and correctly tell the master about the new region.

I've noticed that these regions are showing up on the clients in calls toHTable.getRegionLocations(), so the clients continue to function, but if Icall HBaseAdmin.move() i get an UnknownRegionException.On Mon, Nov 5, 2012 at 7:07 PM, Ted Yu <[EMAIL PROTECTED]> wrote:

It happened in this new table that I have all the logs for. The region inquestion this time is 6839663e4f8f79be3d7469784c21cbc2, and the first traceof this region is on the regionserver with the "Intantiated tableName..."message

I also know this region came from a recent split, but I can't find any logmessages show the parent finishing the split that created this daughterregion. So my guess now is that the split is actually finishing andletting clients continue to write data, but something is failing to printthe log line and correctly tell the master about the new region.

I've noticed that these regions are showing up on the clients in calls toHTable.getRegionLocations(), so the clients continue to function, but if Icall HBaseAdmin.move() i get an UnknownRegionException.On Mon, Nov 5, 2012 at 7:07 PM, Ted Yu <[EMAIL PROTECTED]> wrote:

Yeah - we were running .94.0 since it came out but never saw it there. I'll keep trying to narrow it down. The only harm it's causing is logspam and failing to move daughters to a new regionserver, which aredefinitely problems, but it's not bringing down the cluster.On Mon, Nov 5, 2012 at 9:17 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:

> So it seems you can repeat this to some extend in 0.94.2, but you have> never seen this before?>>> -- Lars>>>> ________________________________> From: Matt Corgan <[EMAIL PROTECTED]>> To: dev <[EMAIL PROTECTED]>> Sent: Monday, November 5, 2012 9:10 PM> Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2>> It happened in this new table that I have all the logs for. The region in> question this time is 6839663e4f8f79be3d7469784c21cbc2, and the first trace> of this region is on the regionserver with the "Intantiated tableName..."> message>> 2012-11-05 22:24:21,162 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:> Instantiated>> StatAreaModelLink,\x00\x00\x07\xD9\x00\x00\x00\x0C\x00\x00\x00\x004H\xC4\xB5\x00\x00\x00\x02\x00\x00\x00\x05\x00\x00\x00\x00G.l\x9B,1352172257535.6839663e4f8f79be3d74> 9784c21cbc2.>> I also know this region came from a recent split, but I can't find any log> messages show the parent finishing the split that created this daughter> region. So my guess now is that the split is actually finishing and> letting clients continue to write data, but something is failing to print> the log line and correctly tell the master about the new region.>> I've noticed that these regions are showing up on the clients in calls to> HTable.getRegionLocations(), so the clients continue to function, but if I> call HBaseAdmin.move() i get an UnknownRegionException.>>> On Mon, Nov 5, 2012 at 7:07 PM, Ted Yu <[EMAIL PROTECTED]> wrote:>> > I think what Matt encountered was this:> > https://issues.apache.org/jira/browse/HBASE-7101> >> > On Mon, Nov 5, 2012 at 5:33 PM, Matt Corgan <[EMAIL PROTECTED]> wrote:> >> > > Beats me - i can't find any record of it before that.> > >> > > I'm importing data into another table now. I disabled/enabled the> table> > > first to make sure we have the original 4 region locations logged> > > everywhere. Will report back...> > >> > >> > > On Mon, Nov 5, 2012 at 2:15 PM, Jean-Daniel Cryans <> [EMAIL PROTECTED]> > > >wrote:> > >> > > > Wow, and can you figure how this happened?> > > >> > > > 2012-11-05 00:24:54,538 DEBUG> > > > org.apache.hadoop.hbase.regionserver.HRegion: Instantiated> > > >> > > >> > >> >> ActiveListingRecord16,\x86\x07\xDC\x03\x17RealtyTrac\x0044737383,1352093084141.22f8fa73d8af837410ca270f344f6bb8.> > > >> > > > Split? Else how did the master told that RS to open a region that> > doesn't> > > > exist?> > > >> > > > On Mon, Nov 5, 2012 at 2:02 PM, Matt Corgan <[EMAIL PROTECTED]>> > wrote:> > > > > Here's one from last night with the master logs at the bottom.> > > > > http://pastebin.com/cSdMbA2a> > > > >> > > > > I don't see that region in the master logs for 4 days back. I'm> > going> > > to> > > > > import some new data from scratch soon and will be sure to keep all> > the> > > > > master logs.> > > > >> > > > >> > > > > On Mon, Nov 5, 2012 at 9:52 AM, Jean-Daniel Cryans <> > > [EMAIL PROTECTED]> > > > >wrote:> > > > >> > > > >> This reminds me a lot of> > > > https://issues.apache.org/jira/browse/HBASE-4792> > > > >>> > > > >> What I'd like to see is the log from the first time the master> > > > >> receives the split message. I guess it says the region doesn't> exist> > > > >> anymore because the split was processed already in the master but> > > > >> there's a failure mode similar to 4792.> > > > >>> > > > >> I saw this on another cluster last week but the logs were rolling> > > > >> based on size so the original split message was lost.> > > > >>> > > > >> J-D> > > > >>> > > > >> On Sun, Nov 4, 2012 at 5:16 PM, Matt Corgan <[EMAIL PROTECTED]>

Yeah - we were running .94.0 since it came out but never saw it there.I'll keep trying to narrow it down. The only harm it's causing is logspam and failing to move daughters to a new regionserver, which aredefinitely problems, but it's not bringing down the cluster.On Mon, Nov 5, 2012 at 9:17 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:

> So it seems you can repeat this to some extend in 0.94.2, but you have> never seen this before?>>> -- Lars>>>> ________________________________> From: Matt Corgan <[EMAIL PROTECTED]>> To: dev <[EMAIL PROTECTED]>> Sent: Monday, November 5, 2012 9:10 PM> Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2>> It happened in this new table that I have all the logs for. The region in> question this time is 6839663e4f8f79be3d7469784c21cbc2, and the first trace> of this region is on the regionserver with the "Intantiated tableName..."> message>> 2012-11-05 22:24:21,162 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:> Instantiated>> StatAreaModelLink,\x00\x00\x07\xD9\x00\x00\x00\x0C\x00\x00\x00\x004H\xC4\xB5\x00\x00\x00\x02\x00\x00\x00\x05\x00\x00\x00\x00G.l\x9B,1352172257535.6839663e4f8f79be3d74> 9784c21cbc2.>> I also know this region came from a recent split, but I can't find any log> messages show the parent finishing the split that created this daughter> region. So my guess now is that the split is actually finishing and> letting clients continue to write data, but something is failing to print> the log line and correctly tell the master about the new region.>> I've noticed that these regions are showing up on the clients in calls to> HTable.getRegionLocations(), so the clients continue to function, but if I> call HBaseAdmin.move() i get an UnknownRegionException.>>> On Mon, Nov 5, 2012 at 7:07 PM, Ted Yu <[EMAIL PROTECTED]> wrote:>> > I think what Matt encountered was this:> > https://issues.apache.org/jira/browse/HBASE-7101> >> > On Mon, Nov 5, 2012 at 5:33 PM, Matt Corgan <[EMAIL PROTECTED]> wrote:> >> > > Beats me - i can't find any record of it before that.> > >> > > I'm importing data into another table now. I disabled/enabled the> table> > > first to make sure we have the original 4 region locations logged> > > everywhere. Will report back...> > >> > >> > > On Mon, Nov 5, 2012 at 2:15 PM, Jean-Daniel Cryans <> [EMAIL PROTECTED]> > > >wrote:> > >> > > > Wow, and can you figure how this happened?> > > >> > > > 2012-11-05 00:24:54,538 DEBUG> > > > org.apache.hadoop.hbase.regionserver.HRegion: Instantiated> > > >> > > >> > >> >> ActiveListingRecord16,\x86\x07\xDC\x03\x17RealtyTrac\x0044737383,1352093084141.22f8fa73d8af837410ca270f344f6bb8.> > > >> > > > Split? Else how did the master told that RS to open a region that> > doesn't> > > > exist?> > > >> > > > On Mon, Nov 5, 2012 at 2:02 PM, Matt Corgan <[EMAIL PROTECTED]>> > wrote:> > > > > Here's one from last night with the master logs at the bottom.> > > > > http://pastebin.com/cSdMbA2a> > > > >> > > > > I don't see that region in the master logs for 4 days back. I'm> > going> > > to> > > > > import some new data from scratch soon and will be sure to keep all> > the> > > > > master logs.> > > > >> > > > >> > > > > On Mon, Nov 5, 2012 at 9:52 AM, Jean-Daniel Cryans <> > > [EMAIL PROTECTED]> > > > >wrote:> > > > >> > > > >> This reminds me a lot of> > > > https://issues.apache.org/jira/browse/HBASE-4792> > > > >>> > > > >> What I'd like to see is the log from the first time the master

Ram, here's the master log corresponding to http://pastebin.com/cSdMbA2a.Looks like e11e8b030897d6e5b973f8fe892e0eb2 was splitting on theregionserver in question (node 169), so i'm guessing that's22f8fa73d8af837410ca270f344f6bb8's mommy.

btw - you can see my balancer kick in 45 seconds later (runs every 10minutes) here, but so far i think that's coincidence:2012-11-05 00:25:29,893 INFO org.apache.hadoop.hbase.master.HMaster:BalanceSwitch=false

I followed the trail of e11e8b030897d6e5b973f8fe892e0eb2 back to node 169and found all this stuff about a failed split: http://pastebin.com/xtXMZ388 andan attempted rollback. Looks like it errors out when it goes to put a nodein ZK but it's already there. I'm not familiar with what a good split loglooks like, so i'll stop commenting for now...On Mon, Nov 5, 2012 at 10:30 PM, ramkrishna vasudevan <[EMAIL PROTECTED]> wrote:

The following steps happen-> Initially the parent region P1 starts splitting.-> The split is going on normally.-> Another split starts at the same time for the same region P1. (Not surewhy this started).-> Rollback happens seeing an already existing node.-> This node gets deleted in rollback and nodeDeleted Event starts.-> In nodeDeleted event the RIT for the region P1 gets deleted.-> Because of this there is no region in RIT.-> Now the first split gets over. Here the problem is we try to transitthe node to SPLITTING to SPLIT. But the node even does not exist.But we don take any action on this. We think it is successful.-> Even before HBASE-6854 this could have happened. Will file a JIRA forthe same.

I dug back further to find the origin of e11e8b030897d6e5b973f8fe892e0eb2to see if it had previous problems that left ZK in a bad state. Here's theregionserver and master logs: http://pastebin.com/qcvHjNCg from about 2hours earlier.

* Nov 4, 22:34: region is created as daugher b of a split on node 159* Nov 4, 22:35: moved to from node 159 to 169 by HBaseAdmin.move()* Nov 5, 00:24: node 169 tries to split the region but gets Failed createof ephemeral /hbase/unassigned/e11e8b030897d6e5b973f8fe892e0eb2

Is it possible that if something calls HBaseAdmin.move() on a daugherregion that is 30 seconds old, it could move the region but leave that ZKnode in a bad state?