I've dealt with dozens of spontaneous shutdowns in recent weeks. (We call them Region Server
Suicides)
The files problem is where the OS (i.e. linux) limits the number of files a user can open
at one time. A common default of 1024 isn't enough for hbase. Based purely on empirical
evidence, you will have failed earlier than 100 million rows if you your problem was the number
of files for the hbase user. You can also run into the same problem for the hadoop user,
but the # of files issue shows up earlier for the hbase user. It is probably best to change
them both at the same time.
While not having enough files is definitely a gotcha, there are a few other things to look
out for as well.
Debugging: One misleading aspect of tracking down this kind of problem is that most of the
messages that show up when you experience it are actually a side effect of something that
happened earlier. You've probably realized this, since you've searched over a long period
of time in your logs.
Other things to consider:
* The most common reason I've had for Region Server Suicide is zookeeper. The region server
thinks zookeeper is down. I thought this had to do with heavy load, but this also happens
for me even when there is nothing running. I haven't been able to find a quantifiable cause.
This is just a weakness that exists in the hbase-zookeeper dependency. Higher loads exacerbate
the problem, but are not required for a Region Server Suicide event to occur.
* Another reason is the HDFS dependency... if a file is perhaps temporarily unavailable for
any reason, HBase handles this situation with Region Server Suicide.
HBase is a powerful tool that allows us to do more with less, but it is currently somewhat
brittle with respect to its dependencies. Suicide is the standard response to any hiccup
with them. Hopefully the response will become less "final" as HBase becomes more robust.
Perhaps if there were a setting, whether or not a region server is allowed to commit suicide,
some of us would feel more comfortable with the idea.
In the mean time, you can try to work around any of these issues by using bigger hardware
than you would otherwise think is needed and not letting the load get very high. For example,
I tend to have these kinds of problems much less often when the load on any individual machine
never goes above the number of cores.
I also recommend sticking to the latest version available.
FYI,
Matthew
On Sep 13, 2010, at 7:20 PM, Jean-Daniel Cryans wrote:
> Can we see the actual line of when it died, with a lot of context and
> please in a pastebin.com
>
> Also, most of the time users get this kind of error because they
> didn't configure HBase and Hadoop properly, mostly the last
> requirement: http://hbase.apache.org/docs/r0.20.6/api/overview-summary.html#requirements
>
> J-D
>
> On Mon, Sep 13, 2010 at 7:08 PM, ZhouShuaifeng 00100568
> <zhoushuaifeng@huawei.com> wrote:
>> Hi All,
>>
>> I encounted some problem when doing putting data test on hbase. Please help. Thanks
a lot.
>>
>> After putting about millions of rows, the 2 region servers of 3 were stopped.
>> Server 1 stopped when putting about 50 million rows.
>> Server 2 stopped when putting about 100 million rows.
>>
>> Some exceptions are throwed.
>> The client exception info is below:
>> org.apache.hadoop.hbase.client.NoServerForRegionException: No server address listed
in .META. for region xxx.
>> at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:833)
>> at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:677)
>> at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfPuts(HConnectionManager.java:1419)
>> at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:664)
>> at org.apache.hadoop.hbase.client.HTable.doPut(HTable.java:549)
>> at org.apache.hadoop.hbase.client.HTable.put(HTable.java:535)
>>
>> The server exception is below:
>> org.apache.hadoop.hbase.NotServingRegionException: xxx. is closed
>> at org.apache.hadoop.hbase.regionserver.HRegion.internalObtainRowLock(HRegion.java:2122)
>> at org.apache.hadoop.hbase.regionserver.HRegion.getLock(HRegion.java:2211)
>> at org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchPut(HRegion.java:1493)
>> at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1447)
>> at org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1703)
>> at org.apache.hadoop.hbase.regionserver.HRegionServer.multiPut(HRegionServer.java:2361)
>> at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source)
>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>> at java.lang.reflect.Method.invoke(Method.java:597)
>> at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:576)
>> at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:919)
>>
>> region server 1 stoped at about 13:59
>> 2010-09-12 13:59:01,017 INFO org.apache.hadoop.hbase.master.ServerManager: 2 region
servers, 1 dead, average load 99.5[md-prod04,60020,1284169880500]
>>
>> the last 2 logs of this regionserver before it stoped is:
>> 2010-09-12 13:57:46,170 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Caches
flushed, doing commit now (which includes update scanners)
>> 2010-09-12 13:57:46,172 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished
memstore flush of ~21.0m for region percontent_hr,2000-01-03#http#001#url18#s#states44#0,1284322397591.8a557b61c9eb4b117368051b98e8d1d1.
in 290ms, sequence id=155719842, compaction requested=false
>>
>> region server 2 stoped at about 19:36:
>> 2010-09-12 19:37:01,104 INFO org.apache.hadoop.hbase.master.ServerManager: 1 region
servers, 1 dead, average load 356.0[md-prod01,60020,1284169861364]
>>
>> the last logs of this regionserver before it stoped is:
>> 2010-09-12 19:36:02,398 INFO org.apache.hadoop.hbase.regionserver.Store: Completed
compaction of 3 file(s) in visitors of xxx.; new storefile is hdfs://xxx; store size is 201.3m
>> 2010-09-12 19:36:02,398 INFO org.apache.hadoop.hbase.regionserver.HRegion: compaction
completed on region xxx. in 9sec
>> 2010-09-12 19:36:04,604 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block
cache LRU eviction started. Attempting to free 20845272 bytes
>> 2010-09-12 19:36:04,609 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block
cache LRU eviction completed. Freed 19920968 bytes. Priority Sizes: Single=43.940674MB (46075136),
Multi=74.515015MB (78134656),Memory=49.535378MB (51941608)
>>
>> ******************************************************************************************
>> This email and its attachments contain confidential information from HUAWEI, which
is intended only for the person or entity whose address is listed above. Any use of the information
contained here in any way (including, but not limited to, total or partial disclosure, reproduction,
or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive
this email in error, please notify the sender by phone or email
>> immediately and delete it!
>> *****************************************************************************************
>>