Details

Description

NameNode uses a java.util.HashMap to store BlockInfo objects. When there are many blocks in HDFS, this map uses a lot of memory in the NameNode. We may optimize the memory usage by a light weight hash table implementation.

Tsz Wo Nicholas Sze
added a comment - 28/Apr/10 08:31 In Java 1.6, java.util.HashMap has an array of HashMap.Entry
//HashMap
transient Entry[] table;
where HashMap.Entry has the following members
//HashMap.Entry
final K key;
V value;
Entry<K,V> next;
final int hash;
In the BlocksMap, we have the following invariants
K == V == BlockInfo
key == value, i.e. they point to the same BlockInfo object.
Therefore, we may reduce the memory footprint by
eliminating HashMap.Entry class
keeping only one object reference
possibly eliminating next and hash
Will provide more design details.

Tsz Wo Nicholas Sze
added a comment - 30/Apr/10 08:33 I believe we can save from 24 to 40 bytes per entry. It depends on the chosen implementation (will give more details later).
In a large clusters, there are ~60m blocks. Then, we may save from 1.5GB to 2.5GB NN memory.

Not using supplemental hash function will result in severe clustering when we move to sequential block IDs (as only higher bits are used for hash).

Why do we need configurability of either using java HashMap or this new implementation?

With new impl, BlockInfo implements LinkedElement interface. On switching to java HashMap would it continue to implement this interface and incur the cost of next member in BlockInfo?

In "Arrays" section the GC behavior description was not clear. Not sure how the GC behavior is better with arrays?

Static array size for the map simplifies the code, but pushes complexity to the cluster admin by adding one more configuration. This configuration is an internal implementation detail which a cluster admin may not understand and get it right. If it configured wrong and the cluster continues to work, cluster admin may not be aware of performance degradation.

I feel we should implement resizing to avoid introducing config param. It is a rare event on a stable cluster. NN has enough heap head room to account for floating garage and YG guarantee. Hence availability of memory should not be an issue. Worst case scenario, resize may trigger a full GC.

If we implement resizing we should also think about 2^N table size as it has potential to waste a lot of memory during doubling, especially considering millions of entries in the table.

Suresh Srinivas
added a comment - 03/Jun/10 17:38
Not using supplemental hash function will result in severe clustering when we move to sequential block IDs (as only higher bits are used for hash).
Why do we need configurability of either using java HashMap or this new implementation?
With new impl, BlockInfo implements LinkedElement interface. On switching to java HashMap would it continue to implement this interface and incur the cost of next member in BlockInfo?
In "Arrays" section the GC behavior description was not clear. Not sure how the GC behavior is better with arrays?
Static array size for the map simplifies the code, but pushes complexity to the cluster admin by adding one more configuration. This configuration is an internal implementation detail which a cluster admin may not understand and get it right. If it configured wrong and the cluster continues to work, cluster admin may not be aware of performance degradation.
I feel we should implement resizing to avoid introducing config param. It is a rare event on a stable cluster. NN has enough heap head room to account for floating garage and YG guarantee. Hence availability of memory should not be an issue. Worst case scenario, resize may trigger a full GC.
If we implement resizing we should also think about 2^N table size as it has potential to waste a lot of memory during doubling, especially considering millions of entries in the table.

Tsz Wo Nicholas Sze
added a comment - 08/Jun/10 17:34 > 1. Not using supplemental hash function will result in severe clustering when we move to sequential block IDs (as only higher bits are used for hash).
It is not a problem in our case because the hashCode() implematation in Block uses both higher and lower bits of the block ID.
> 3. In "Arrays" section the GC behavior description was not clear. Not sure how the GC behavior is better with arrays?
The GC algorithm traverses the objects to determine which objects can be garbage collected. The GC behavior is better in arrays in the sense that there are fewer references in arrays.
> 2, 4, 5 & 6
All this items are related to configuration and the hash table length. See the new design doc.
gset20100608.pdf: rewrote Section 7

if you are using a power of two hash table, you can avoid problems caused by hash value clustering by using a Fibonacci Hash. Essentially, use the multiplicative hash with a special value g:

(h * g) & mask

where h is the hash value and g is the 'golden ratio' number for the size of the table used. Since multiplication on today's processors is far faster than division or remainders, this can be used to 'uncluster' hash values. A single consecutive run of values gets maximally distributed into the space, and high and low bits are redistributed evenly so that the mask does not increase collisions. Whether this is a desired property or not will depend on the properties of the hash values and whether or not an open addressing solution is used.

Open addressing can further reduce the memory footprint by allowing the raw object to be placed in the map instead of a container object or list.

Scott Carey
added a comment - 14/Jun/10 03:20 if you are using a power of two hash table, you can avoid problems caused by hash value clustering by using a Fibonacci Hash. Essentially, use the multiplicative hash with a special value g:
(h * g) & mask
where h is the hash value and g is the 'golden ratio' number for the size of the table used. Since multiplication on today's processors is far faster than division or remainders, this can be used to 'uncluster' hash values. A single consecutive run of values gets maximally distributed into the space, and high and low bits are redistributed evenly so that the mask does not increase collisions. Whether this is a desired property or not will depend on the properties of the hash values and whether or not an open addressing solution is used.
Open addressing can further reduce the memory footprint by allowing the raw object to be placed in the map instead of a container object or list.
some links found from a few searches:
http://www.brpreiss.com/books/opus4/html/page214.html
http://staff.newport.ac.uk/ctubb01/ct/advp/hashtables.pdf

Tsz Wo Nicholas Sze
added a comment - 14/Jun/10 16:02 Scott, thanks for your comments and the useful links.
Since Block IDs are random, we don't need the extra computation in our case. For general hash table implementation, it is definitely a good idea.

Tsz Wo Nicholas Sze
added a comment - 14/Jun/10 22:33 h1114_20100614b.patch:
changed BlocksMap and BlockInfo in order to use LightWeightGSet.
changed TestGSet for faster execution time. It took 3.141 seconds in my machine.
added new tests for the exception cases.
Have run a large set of tests (see TestGSet.runMultipleTestGSet()) with various parameters. It took 5.5 hours and passed all tests.

Capacity should be divided by a reference size 8 or 4 depending on the 64bit or 32bit java version

Current capacity calculation seems quite complex. Add more explanation on why it is implemented that way.

LightWeightGSet.java

"which uses a hash table for storing the elements" should this say "uses array"?

Add a comment that the size of entries is power of two

Throw HadoopIllegalArgumentException instead of IllegalArgumentException (for 20 version of the patch it could remain IllegalArugmentException)

remove() - for better readability no need for else if and else since the previous block returns

toString() - prints the all the entries. This is a bad idea if some one passes this object to Log unknowingly. If all the details of the HashMap is needed, we should have some other method such as dump() or printDetails() to do the same.

TestGSet.java

In exception tests, instead of printing log when expected exception happened, print a log in Assert.fail(), like Assert.fail("Excepected exception was not thrown"). Check for exceptions should be more specific, instead Exception. It is also good idea to document these exceptions in javadoc for methods in GSet.

println should use Log.info instead of System.out.println?

add some comments to classes on what they do/how they are used

add some comments to GSetTestCase members denominator etc. and constructor

Suresh Srinivas
added a comment - 16/Jun/10 23:37
BlocksMap.java
typo exponient. Should be exponent?
Capacity should be divided by a reference size 8 or 4 depending on the 64bit or 32bit java version
Current capacity calculation seems quite complex. Add more explanation on why it is implemented that way.
LightWeightGSet.java
"which uses a hash table for storing the elements" should this say "uses array"?
Add a comment that the size of entries is power of two
Throw HadoopIllegalArgumentException instead of IllegalArgumentException (for 20 version of the patch it could remain IllegalArugmentException)
remove() - for better readability no need for else if and else since the previous block returns
toString() - prints the all the entries. This is a bad idea if some one passes this object to Log unknowingly. If all the details of the HashMap is needed, we should have some other method such as dump() or printDetails() to do the same.
TestGSet.java
In exception tests, instead of printing log when expected exception happened, print a log in Assert.fail(), like Assert.fail("Excepected exception was not thrown"). Check for exceptions should be more specific, instead Exception. It is also good idea to document these exceptions in javadoc for methods in GSet.
println should use Log.info instead of System.out.println?
add some comments to classes on what they do/how they are used
add some comments to GSetTestCase members denominator etc. and constructor
add comments to testGSet() on what each of the case is accomplishing

Scott Carey
added a comment - 17/Jun/10 02:29 # Capacity should be divided by a reference size 8 or 4 depending on the 64bit or 32bit java version
What about -XX:+UseCompressedOops ?
All users should be using this flag on a 64 bit JVM to save a lot of space. It only works up to -Xmx32G though, beyond that its large pointers again.

For figuring out 64 bit, should we consider the max heap size. If max heap size > 2G consider it as 64 bit machine. Since max heap size on 32 bit machines vary, 1.4G to 2G, such machines in that range could be wrongly classified as 32 bit. Is this an alternative worth considering?

Minor: "print detail" to "print detailed"

Minor: For end of line comments should there be space after //. Java coding conventions explicitly do not talk about this though. Currently there 3043 comments with space after // and 384 without that

Minor: In exceptions tests, in my previous comment, what I meant was you are better of printing to log in Assert.fail(). Printing log when expected thing happens is not that useful. That said, this is minor, you can leave it as it is.

I am not sure what the point of commenting out 5 hours test is. When do we expect it to be uncommented and run? Should it be moved to some other test that is run as smoke test for release qualification?

Suresh Srinivas
added a comment - 17/Jun/10 17:44
For figuring out 64 bit, should we consider the max heap size. If max heap size > 2G consider it as 64 bit machine. Since max heap size on 32 bit machines vary, 1.4G to 2G, such machines in that range could be wrongly classified as 32 bit. Is this an alternative worth considering?
Minor: "print detail" to "print detailed"
Minor: For end of line comments should there be space after //. Java coding conventions explicitly do not talk about this though. Currently there 3043 comments with space after // and 384 without that
Minor: In exceptions tests, in my previous comment, what I meant was you are better of printing to log in Assert.fail(). Printing log when expected thing happens is not that useful. That said, this is minor, you can leave it as it is.
I am not sure what the point of commenting out 5 hours test is. When do we expect it to be uncommented and run? Should it be moved to some other test that is run as smoke test for release qualification?

Tsz Wo Nicholas Sze
added a comment - 17/Jun/10 18:58 h1114_20100617b.patch: slightly changed the comments and removed unnecessary spaces.
I did not change the capacity calculation because the current computation is conservative on the special cases.

[exec] +1 overall.
[exec]
[exec] +1 @author. The patch does not contain any @author tags.
[exec]
[exec] +1 tests included. The patch appears to include 17 new or modified tests.
[exec]
[exec] +1 javadoc. The javadoc tool did not generate any warning messages.
[exec]
[exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
[exec]
[exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
[exec]
[exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.

This is a good point. Is there a way to determine if UseCompressedOops is set in runtime?

Well, there is ManagementFactory.getRuntimeMXBean().getInputArguments(), but later versions of Java are going to be making +UseCompressedOops the default.

There is also a way to check if the VM is 64 bit or 32 bit, either its out of ManagementFactory or one of the system properties. Digging around I don't see it, but I have used it before. I think it is vendor specific though.

Scott Carey
added a comment - 19/Jun/10 04:38 This is a good point. Is there a way to determine if UseCompressedOops is set in runtime?
Well, there is ManagementFactory.getRuntimeMXBean().getInputArguments(), but later versions of Java are going to be making +UseCompressedOops the default.
There is also a way to check if the VM is 64 bit or 32 bit, either its out of ManagementFactory or one of the system properties. Digging around I don't see it, but I have used it before. I think it is vendor specific though.

> There is also a way to check if the VM is 64 bit or 32 bit, either its out of ManagementFactory or one of the system properties. Digging around I don't see it, but I have used it before. I think it is vendor specific though.

Tsz Wo Nicholas Sze
added a comment - 23/Jun/10 17:17 > There is also a way to check if the VM is 64 bit or 32 bit, either its out of ManagementFactory or one of the system properties. Digging around I don't see it, but I have used it before. I think it is vendor specific though.
I used System.getProperty("sun.arch.data.model") in the patch. See http://java.sun.com/docs/hotspot/HotSpotFAQ.html#64bit_detection