We have production cluster that runs on hbase 0.94.1. The issue we arefacing is whenever one regionserver goes down, the cluster becomesunresponsive until all the regions are allocated to anotherregionserver(s). The transition is taking about 3-5 mins and during thistime we are unable to any do client operation on the cluster.

Is there any way we can make the transition to run in background ?

Also, it is acceptable for us if the client operations such as scan or getdoes not work on the rowkeys of regions in transition. But, they are notworking on the entire cluster until all the regions are moved out oftransition. We can't afford 3-5 minutes of downtime.

How many total RS in the cluster? You mean u can not do any operation onother regions in the live clusters? It should not happen.. Is it sohappening that the client ops are targetted at the regions which were inthe dead RS( and in transition now)? Can u have a closer look and see?If not pls check the RS threads were they are getting blocked.

-Anoop-

On Wed, Jun 5, 2013 at 10:50 PM, kiran <[EMAIL PROTECTED]> wrote:

> Dear All,>> We have production cluster that runs on hbase 0.94.1. The issue we are> facing is whenever one regionserver goes down, the cluster becomes> unresponsive until all the regions are allocated to another> regionserver(s). The transition is taking about 3-5 mins and during this> time we are unable to any do client operation on the cluster.>> Is there any way we can make the transition to run in background ?>> Also, it is acceptable for us if the client operations such as scan or get> does not work on the rowkeys of regions in transition. But, they are not> working on the entire cluster until all the regions are moved out of> transition. We can't afford 3-5 minutes of downtime.>> --> Thank you> Kiran Sarvabhotla>> -----Even a correct decision is wrong when it is taken late>

Also, any chance for you to migrate to 0.94.8? There have beenhundreds of fixes since 0.94.1...

JM

2013/6/6 Anoop John <[EMAIL PROTECTED]>:> How many total RS in the cluster? You mean u can not do any operation on> other regions in the live clusters? It should not happen.. Is it so> happening that the client ops are targetted at the regions which were in> the dead RS( and in transition now)? Can u have a closer look and see?> If not pls check the RS threads were they are getting blocked.>> -Anoop->> On Wed, Jun 5, 2013 at 10:50 PM, kiran <[EMAIL PROTECTED]> wrote:>>> Dear All,>>>> We have production cluster that runs on hbase 0.94.1. The issue we are>> facing is whenever one regionserver goes down, the cluster becomes>> unresponsive until all the regions are allocated to another>> regionserver(s). The transition is taking about 3-5 mins and during this>> time we are unable to any do client operation on the cluster.>>>> Is there any way we can make the transition to run in background ?>>>> Also, it is acceptable for us if the client operations such as scan or get>> does not work on the rowkeys of regions in transition. But, they are not>> working on the entire cluster until all the regions are moved out of>> transition. We can't afford 3-5 minutes of downtime.>>>> -->> Thank you>> Kiran Sarvabhotla>>>> -----Even a correct decision is wrong when it is taken late>>

@Anoop we purposefully brought down one regionserver, then we observed thewebsite is taking too much time to respond. We observed the pattern forabout 5 min till the regions are relocated.Also we issued queries in our website taking care that the queries did n'tcome under the regions in the regionserver we brought down.

> Hi Kiran,>> Also, any chance for you to migrate to 0.94.8? There have been> hundreds of fixes since 0.94.1...>> JM>> 2013/6/6 Anoop John <[EMAIL PROTECTED]>:> > How many total RS in the cluster? You mean u can not do any operation on> > other regions in the live clusters? It should not happen.. Is it so> > happening that the client ops are targetted at the regions which were in> > the dead RS( and in transition now)? Can u have a closer look and see?> > If not pls check the RS threads were they are getting blocked.> >> > -Anoop-> >> > On Wed, Jun 5, 2013 at 10:50 PM, kiran <[EMAIL PROTECTED]>> wrote:> >> >> Dear All,> >>> >> We have production cluster that runs on hbase 0.94.1. The issue we are> >> facing is whenever one regionserver goes down, the cluster becomes> >> unresponsive until all the regions are allocated to another> >> regionserver(s). The transition is taking about 3-5 mins and during this> >> time we are unable to any do client operation on the cluster.> >>> >> Is there any way we can make the transition to run in background ?> >>> >> Also, it is acceptable for us if the client operations such as scan or> get> >> does not work on the rowkeys of regions in transition. But, they are not> >> working on the entire cluster until all the regions are moved out of> >> transition. We can't afford 3-5 minutes of downtime.> >>> >> --> >> Thank you> >> Kiran Sarvabhotla> >>> >> -----Even a correct decision is wrong when it is taken late> >>>

What was your test exactly? You killed -9 a region server but kept thedatanode alive?Could you detail the queries you were doing?On Wed, Jun 12, 2013 at 2:10 PM, kiran <[EMAIL PROTECTED]> wrote:

> It is not possible for us to migrate to new version immediately.>> @Anoop we purposefully brought down one regionserver, then we observed the> website is taking too much time to respond. We observed the pattern for> about 5 min till the regions are relocated.> Also we issued queries in our website taking care that the queries did n't> come under the regions in the regionserver we brought down.>> Is there any configuration workaround to mitigate it??>> Thanks> Kiran>>>> On Thu, Jun 6, 2013 at 8:27 PM, Jean-Marc Spaggiari <> [EMAIL PROTECTED]> > wrote:>> > Hi Kiran,> >> > Also, any chance for you to migrate to 0.94.8? There have been> > hundreds of fixes since 0.94.1...> >> > JM> >> > 2013/6/6 Anoop John <[EMAIL PROTECTED]>:> > > How many total RS in the cluster? You mean u can not do any operation> on> > > other regions in the live clusters? It should not happen.. Is it so> > > happening that the client ops are targetted at the regions which were> in> > > the dead RS( and in transition now)? Can u have a closer look and> see?> > > If not pls check the RS threads were they are getting blocked.> > >> > > -Anoop-> > >> > > On Wed, Jun 5, 2013 at 10:50 PM, kiran <[EMAIL PROTECTED]>> > wrote:> > >> > >> Dear All,> > >>> > >> We have production cluster that runs on hbase 0.94.1. The issue we are> > >> facing is whenever one regionserver goes down, the cluster becomes> > >> unresponsive until all the regions are allocated to another> > >> regionserver(s). The transition is taking about 3-5 mins and during> this> > >> time we are unable to any do client operation on the cluster.> > >>> > >> Is there any way we can make the transition to run in background ?> > >>> > >> Also, it is acceptable for us if the client operations such as scan or> > get> > >> does not work on the rowkeys of regions in transition. But, they are> not> > >> working on the entire cluster until all the regions are moved out of> > >> transition. We can't afford 3-5 minutes of downtime.> > >>> > >> --> > >> Thank you> > >> Kiran Sarvabhotla> > >>> > >> -----Even a correct decision is wrong when it is taken late> > >>> >>>>> --> Thank you> Kiran Sarvabhotla>> -----Even a correct decision is wrong when it is taken late>

> What was your test exactly? You killed -9 a region server but kept the> datanode alive?> Could you detail the queries you were doing?>>> On Wed, Jun 12, 2013 at 2:10 PM, kiran <[EMAIL PROTECTED]>> wrote:>> > It is not possible for us to migrate to new version immediately.> >> > @Anoop we purposefully brought down one regionserver, then we observed> the> > website is taking too much time to respond. We observed the pattern for> > about 5 min till the regions are relocated.> > Also we issued queries in our website taking care that the queries did> n't> > come under the regions in the regionserver we brought down.> >> > Is there any configuration workaround to mitigate it??> >> > Thanks> > Kiran> >> >> >> > On Thu, Jun 6, 2013 at 8:27 PM, Jean-Marc Spaggiari <> > [EMAIL PROTECTED]> > > wrote:> >> > > Hi Kiran,> > >> > > Also, any chance for you to migrate to 0.94.8? There have been> > > hundreds of fixes since 0.94.1...> > >> > > JM> > >> > > 2013/6/6 Anoop John <[EMAIL PROTECTED]>:> > > > How many total RS in the cluster? You mean u can not do any> operation> > on> > > > other regions in the live clusters? It should not happen.. Is it so> > > > happening that the client ops are targetted at the regions which were> > in> > > > the dead RS( and in transition now)? Can u have a closer look and> > see?> > > > If not pls check the RS threads were they are getting blocked.> > > >> > > > -Anoop-> > > >> > > > On Wed, Jun 5, 2013 at 10:50 PM, kiran <[EMAIL PROTECTED]>> > > wrote:> > > >> > > >> Dear All,> > > >>> > > >> We have production cluster that runs on hbase 0.94.1. The issue we> are> > > >> facing is whenever one regionserver goes down, the cluster becomes> > > >> unresponsive until all the regions are allocated to another> > > >> regionserver(s). The transition is taking about 3-5 mins and during> > this> > > >> time we are unable to any do client operation on the cluster.> > > >>> > > >> Is there any way we can make the transition to run in background ?> > > >>> > > >> Also, it is acceptable for us if the client operations such as scan> or> > > get> > > >> does not work on the rowkeys of regions in transition. But, they are> > not> > > >> working on the entire cluster until all the regions are moved out of> > > >> transition. We can't afford 3-5 minutes of downtime.> > > >>> > > >> --> > > >> Thank you> > > >> Kiran Sarvabhotla> > > >>> > > >> -----Even a correct decision is wrong when it is taken late> > > >>> > >> >> >> >> > --> > Thank you> > Kiran Sarvabhotla> >> > -----Even a correct decision is wrong when it is taken late> >>

Yes we killed the region server but datanode is still running on the node...

Sample Test scenario: Assume, I have table with pre-splits a upto z (about26 regions). I brought down region server purposefully with regions havingprefixes c and d. Then I used client API to scan data from regions withprefixes other than c and d. The response was very slow and sometimes notcoming at all.

My doubt was if only regions with prefix c and d are getting relocated orin transition. Why is it affecting the regions with other prefixes.... Butonce the region transition is over, the response is very fast as expected.

Its a simple kill...Scan is used using startrow and stoprowScan scan = new Scan(Bytes.toBytes("adidas"), Bytes.toBytes("adidas1"));Our cluster size is 15. The load average when I see in master is 78%...Itis not that overloaded. but writes are happening in the cluster...