Details

Description

This would add a new boolean config option: hfile.block.cache.datablocks
Default would be true.

Setting this to false allows HBase in a mode where only index blocks are cached, which is useful for analytical scenarios where a useful working set of the data cannot be expected to fit into the (aggregate) cache.
This is the equivalent of setting cacheBlocks to false on all scans (including scans on behalf of gets).

I would like to get a general feeling about what folks think about this.
The change itself would be simple.

Update (Mikhail): we probably don't need a new conf option. Instead, we will make index blocks cached by default.

Good question, I can't tell really without a use case although. IIUC, currently with HFile v2 setting block cache to false doesn't even cache index blocks? If so, my opinion would be that we should always cache index blocks and don't even bother having a config for that. I'm curious as if there's ever a situation where you wouldn't want to cache index blocks since they are so small.

Jean-Daniel Cryans
added a comment - 17/Nov/11 22:24 Is useful for anybody?
I'm sure it would be.
Should this maybe be set per CF?
Good question, I can't tell really without a use case although. IIUC, currently with HFile v2 setting block cache to false doesn't even cache index blocks? If so, my opinion would be that we should always cache index blocks and don't even bother having a config for that. I'm curious as if there's ever a situation where you wouldn't want to cache index blocks since they are so small.

Lars Hofhansl
added a comment - 17/Nov/11 22:32 That might be a better option (always cache index blocks), looks like that'd be some amount of refactoring (as CacheConfig.blockCache is currently null if cache size is set to 0).
The 2nd part of this is classifying index blocks (optionally) as in memory, so they are less likely to be evicted due to churn of data blocks.

Jean-Daniel Cryans
added a comment - 17/Nov/11 22:40 I'm fine with not caching index blocks if there's no cache, but maybe we should educate our users instead to disable block caching on their tables instead in order to have index caching.
The 2nd part of this is classifying index blocks (optionally) as in memory, so they are less likely to be evicted due to churn of data blocks.
Yeah... I wonder if it's really going to make a difference...

Looks like it in the code. blockCache is only null if hfile.block.cache.size is set to 0. If it is not null, the cache setting is already overridden for index blocks in HFileBlockIndex.seekToDataBlock.

Lars Hofhansl
added a comment - 17/Nov/11 23:05 Looks like it in the code. blockCache is only null if hfile.block.cache.size is set to 0. If it is not null, the cache setting is already overridden for index blocks in HFileBlockIndex.seekToDataBlock.
I'll verify then close this if confirmed.

Mikhail Bautin
added a comment - 09/Dec/11 21:54 This is from HFileBlockIndex.java, line 202 (trunk): boolean shouldCache = cacheBlocks || (lookupLevel < searchTreeLevel)
On the other hand, in HFileReaderV2 we have: cacheBlock &= cacheConf.shouldCacheDataOnRead().
So currently, when we disable block caching, we don't even cache data blocks, and there is a valid issue to be fixed in this JIRA.

Lars Hofhansl
added a comment - 09/Dec/11 22:02 Thanks Mikkhail.
So it's indeed not possible to achieve this currently (by setting disabling BLOCKCACHE on the CF).
I'll post a patch when I get to this (unless you want to )

Lars: we won't be able to cache index blocks if the block cache is disabled completely. However, if it is enabled, we should always cache index blocks. Please let me know if your understanding of this issue is different.

Mikhail Bautin
added a comment - 12/Dec/11 18:39 Lars: we won't be able to cache index blocks if the block cache is disabled completely. However, if it is enabled, we should always cache index blocks. Please let me know if your understanding of this issue is different.

Actually shouldCacheDataOnRead is weird. It is true by default as long as block cache is enabled, unless it is explicitly disabled in cacheConf. So I don't really know if that particular line ever becomes a problem when block cache is enabled.

Mikhail Bautin
added a comment - 12/Dec/11 21:34 Actually shouldCacheDataOnRead is weird. It is true by default as long as block cache is enabled, unless it is explicitly disabled in cacheConf. So I don't really know if that particular line ever becomes a problem when block cache is enabled.

This patch for 0.92 makes it work for me. I'm backporting a few things from trunk, adding the method that Lars had in his patch and doing an additional check in readBlock to make sure we always cache indexes if there's an LRU.

Jean-Daniel Cryans
added a comment - 12/Dec/11 22:14 This patch for 0.92 makes it work for me. I'm backporting a few things from trunk, adding the method that Lars had in his patch and doing an additional check in readBlock to make sure we always cache indexes if there's an LRU.

Making sure that we always cache index and bloom blocks, as long as we have block cache, even if the cache is disabled for the CF. Adding a unit test to verify this using metrics. In process of doing this, I've factored out a region creation method in HBaseTestingUtility, and cleaned up a few things in HFile readers/writers. I have also categorized HFile v1 bloom blocks (that are technically meta blocks) as bloom blocks in metrics.

lhofhansl has commented on the revision "[jira]HBASE-4683 Always cache index and bloom blocks".

Looks good to me generally. The metric changes are unrelated, though. Correct?

INLINE COMMENTS
src/main/java/org/apache/hadoop/hbase/regionserver/metrics/SchemaConfigured.java:192 This is an unrelated change it seems
src/main/java/org/apache/hadoop/hbase/regionserver/metrics/SchemaConfigured.java:248 another unrelated change

Actually, the metrics changes are important. The unit test uses per-CF metrics to determine whether the blocks have been cached (or not cached, for data blocks), and when running the unit test I discovered some bugs in how table and CF name are passed down to HFile readers/writers, so I had to clean that up a bit.

Phabricator
added a comment - 13/Dec/11 21:21 jdcryans has commented on the revision " [jira] HBASE-4683 Always cache index and bloom blocks".
I only have one question with compactions, after that I'm +1
INLINE COMMENTS
src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java:293 Are we caching even in the case of isCompaction?
REVISION DETAIL
https://reviews.facebook.net/D807

mbautin has commented on the revision "[jira]HBASE-4683 Always cache index and bloom blocks".

INLINE COMMENTS
src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java:293 Compactions mostly consist of continuous reads and should not do a lot of seeks. But yes, this will cache index/bloom blocks even in case of compactions. I guess I could add a unit test to make sure we really don't utilize the block index during compactions, and we can explicitly disable them here. Or we could leave it as it is in case compactions do use the block index in a limited way (need to find out if that is the case).
@jdcryans: any thoughts/suggestions on this?

Phabricator
added a comment - 14/Dec/11 01:58 mbautin has commented on the revision " [jira] HBASE-4683 Always cache index and bloom blocks".
INLINE COMMENTS
src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java:293 Compactions mostly consist of continuous reads and should not do a lot of seeks. But yes, this will cache index/bloom blocks even in case of compactions. I guess I could add a unit test to make sure we really don't utilize the block index during compactions, and we can explicitly disable them here. Or we could leave it as it is in case compactions do use the block index in a limited way (need to find out if that is the case).
@jdcryans: any thoughts/suggestions on this?
REVISION DETAIL
https://reviews.facebook.net/D807

Phabricator
added a comment - 14/Dec/11 02:30 lhofhansl has commented on the revision " [jira] HBASE-4683 Always cache index and bloom blocks".
I would guess that even compactions do reseeks to the next column.
Those reseeks would be slow without the index blocks in cache.
REVISION DETAIL
https://reviews.facebook.net/D807

Phabricator
added a comment - 14/Dec/11 02:56 jdcryans has accepted the revision " [jira] HBASE-4683 Always cache index and bloom blocks".
My testing of the 0.92 patch I put in the jira shows that it works as advertised, so +1. @mbautin would you mind giving a quick at that patch to make sure I didn't do anything weird?
REVISION DETAIL
https://reviews.facebook.net/D807

@jdcryans: you committed this one, right? Closing the Phabricator revision.

A reminder to committers (quoting @nspiegelberg) so that we can properly close Phabricator revisions when committed (cc: @tedyu, @stack).

> since we've decided to allow more Metadata in our commit logs, it would be nice to use the patch comments when committing this diff. Namely, including 'Differential Revision: #' will allow Phabricator to automatically mark the review as closed.

Here are the steps to set up Arcanist and Phabricator in your working directory so you can auto-generate the commit message:

Phabricator
added a comment - 19/Dec/11 22:09 mbautin has abandoned the revision " [jira] HBASE-4683 Always cache index and bloom blocks".
Added CCs: nspiegelberg, stack, tedyu
@jdcryans: you committed this one, right? Closing the Phabricator revision.
A reminder to committers (quoting @nspiegelberg) so that we can properly close Phabricator revisions when committed (cc: @tedyu, @stack).
> since we've decided to allow more Metadata in our commit logs, it would be nice to use the patch comments when committing this diff. Namely, including 'Differential Revision: #' will allow Phabricator to automatically mark the review as closed.
Here are the steps to set up Arcanist and Phabricator in your working directory so you can auto-generate the commit message:
mvn -Darc initialize
arc amend --revision <revision_number>
You can install Arcanist (arc) by following the instructions at http://www.phabricator.com/docs/phabricator/article/Arcanist_User_Guide.html . More instructions about Phabricator/JIRA integration are available at https://cwiki.apache.org/confluence/display/Hive/PhabricatorCodeReview (that is a howto for Hive but it should be very similar for HBase).
REVISION DETAIL
https://reviews.facebook.net/D807

Mikhail Bautin
added a comment - 10/Feb/12 03:45 @jdcryans: it looks like TestForceCacheImportantBlocks.java was not included into your trunk commit of this JIRA. I will verify that the test still works and post an addendum patch.

Summary: This is a unit test that should have been part of HBASE-4683 but was
not committed. The original test was reviewed as part ofhttps://reviews.facebook.net/D807. Submitting unit test as a separate JIRA and
patch, and extending the scope of the test to also handle the case when block
cache is enabled for the column family.

Hudson
added a comment - 10/Feb/12 21:20 Integrated in HBase-TRUNK #2658 (See https://builds.apache.org/job/HBase-TRUNK/2658/ )
[jira] HBASE-5382 Test that we always cache index and bloom blocks
Summary: This is a unit test that should have been part of HBASE-4683 but was
not committed. The original test was reviewed as part of
https://reviews.facebook.net/D807 . Submitting unit test as a separate JIRA and
patch, and extending the scope of the test to also handle the case when block
cache is enabled for the column family.
Test Plan: Run unit tests
Reviewers: JIRA, jdcryans, lhofhansl, Liyin
Reviewed By: jdcryans
CC: jdcryans
Differential Revision: https://reviews.facebook.net/D1695
mbautin :
Files :
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/hfile/TestForceCacheImportantBlocks.java

Summary: This is a unit test that should have been part of HBASE-4683 but was
not committed. The original test was reviewed as part ofhttps://reviews.facebook.net/D807. Submitting unit test as a separate JIRA and
patch, and extending the scope of the test to also handle the case when block
cache is enabled for the column family.

Hudson
added a comment - 11/Feb/12 05:53 Integrated in HBase-TRUNK-security #108 (See https://builds.apache.org/job/HBase-TRUNK-security/108/ )
[jira] HBASE-5382 Test that we always cache index and bloom blocks
Summary: This is a unit test that should have been part of HBASE-4683 but was
not committed. The original test was reviewed as part of
https://reviews.facebook.net/D807 . Submitting unit test as a separate JIRA and
patch, and extending the scope of the test to also handle the case when block
cache is enabled for the column family.
Test Plan: Run unit tests
Reviewers: JIRA, jdcryans, lhofhansl, Liyin
Reviewed By: jdcryans
CC: jdcryans
Differential Revision: https://reviews.facebook.net/D1695
mbautin :
Files :
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/hfile/TestForceCacheImportantBlocks.java