enable HDFS local reads via mmap

Details

Description

Currently, the short-circuit local read pathway allows HDFS clients to access files directly without going through the DataNode. However, all of these reads involve a copy at the operating system level, since they rely on the read() / pread() / etc family of kernel interfaces.

We would like to enable HDFS to read local files via mmap. This would enable truly zero-copy reads.

In the initial implementation, zero-copy reads will only be performed when checksums were disabled. Later, we can use the DataNode's cache awareness to only perform zero-copy reads when we know that checksum has already been verified.

We did some microbenchmarks where the existing short-circuit implementation (SCR) got around 2500 MB/s, and zero copy reads (ZCR) got around 7500 MB/s. We found that the advantage disappeared if we had to re-create the mmap on each read through the file – that's why this change includes a long-lived cache of mmaps in ClientMmapManager.

Colin Patrick McCabe
added a comment - 13/Jul/13 01:15 well... zero copy for remote readers isn't possible, by definition
I cannae change the laws o' physics, cap'n.
We did some microbenchmarks where the existing short-circuit implementation (SCR) got around 2500 MB/s, and zero copy reads (ZCR) got around 7500 MB/s. We found that the advantage disappeared if we had to re-create the mmap on each read through the file – that's why this change includes a long-lived cache of mmaps in ClientMmapManager.

Attached a benchmark plot I did in support of this work. The benchmark times processing a 1GB file which fits in buffer cache. The "processing" here is summing up the entire file as if it's a bunch of integers end-to-end (written using SSE, etc to be as efficient as possible).

The various items plotted here are:

h: the current libhdfs code, with SCR, but without ZCR (averages around 3G/sec or so)

m: a C program which mallocs 1GB, reads the data into that buffer, and then runs the analysis on the malloced buffer. This is the upper bound performance. Gets about 8GB/sec

mmap-each: C program which opens a local file, and on each iteration of processing, calls "mmap" and then processes it. Gets about 3G/sec. "perf top" indicates that this is slow because of page table entry population overhead (minor page faults)

mmap-populate-each: the same, but with the MAP_POPULATE flag. Gets around 4500M/sec. This is faster because it pre-populates the page table entries.

mmap-once: the same, but only mmaps once, and doesn't count the mmap time. Gets around the same speed as the "malloc" path.

z: the ZCR implementation without the mmap caching. Gets the same as mmap-each, more or less, because of the same PTE faulting overhead.

These graphs show why we have to have the mmap cache – and indicate that, with that cache, we should be in the same ballpark as the optimal (~9GB/sec/core).

Todd Lipcon
added a comment - 13/Jul/13 01:41 Attached a benchmark plot I did in support of this work. The benchmark times processing a 1GB file which fits in buffer cache. The "processing" here is summing up the entire file as if it's a bunch of integers end-to-end (written using SSE, etc to be as efficient as possible).
The various items plotted here are:
h: the current libhdfs code, with SCR, but without ZCR (averages around 3G/sec or so)
m: a C program which mallocs 1GB, reads the data into that buffer, and then runs the analysis on the malloced buffer. This is the upper bound performance. Gets about 8GB/sec
mmap-each: C program which opens a local file, and on each iteration of processing, calls "mmap" and then processes it. Gets about 3G/sec. "perf top" indicates that this is slow because of page table entry population overhead (minor page faults)
mmap-populate-each: the same, but with the MAP_POPULATE flag. Gets around 4500M/sec. This is faster because it pre-populates the page table entries.
mmap-once: the same, but only mmaps once, and doesn't count the mmap time. Gets around the same speed as the "malloc" path.
z: the ZCR implementation without the mmap caching. Gets the same as mmap-each, more or less, because of the same PTE faulting overhead.
These graphs show why we have to have the mmap cache – and indicate that, with that cache, we should be in the same ballpark as the optimal (~9GB/sec/core).

Oh, one other note about the benchmark graphs: you see some bimodal behavior in some of the graphs where a portion of the reads run ~30% slower than the rest. Our best assumption here is that this is NUMA behavior (the tests were run on a dual-socket system). So, I think we should probably call mbind with MPOL_INTERLEAVE on our mmaped segments so that we have more consistent performance.

Another option would be to mbind the mapped segments to static nodes and then expose the numa zone info to clients, so we can get socket-local scheduling. But let's leave that for later

Todd Lipcon
added a comment - 13/Jul/13 01:43 Oh, one other note about the benchmark graphs: you see some bimodal behavior in some of the graphs where a portion of the reads run ~30% slower than the rest. Our best assumption here is that this is NUMA behavior (the tests were run on a dual-socket system). So, I think we should probably call mbind with MPOL_INTERLEAVE on our mmaped segments so that we have more consistent performance.
Another option would be to mbind the mapped segments to static nodes and then expose the numa zone info to clients, so we can get socket-local scheduling. But let's leave that for later

I think, if we read different blocks data here, we need to map for each separately right? So, cache gets filled quickly and what is the result with this kind of scenario?

Also for smaller files, this may be overhead right as mapping overhead may be more than reads. (can add minimum block size for mmap?)

Did you try implementing native call for unmap instead of depending on Sun package. In fact I too introduced dependency on this cleaner code when I try introducing this on write path at DNs in my internal branch. .

Uma Maheswara Rao G
added a comment - 15/Jul/13 21:10
if (close) {
- out.close();
out = null ;
+ out.close();
in.close();
I think you have changed this by mistake?
I think, if we read different blocks data here, we need to map for each separately right? So, cache gets filled quickly and what is the result with this kind of scenario?
Also for smaller files, this may be overhead right as mapping overhead may be more than reads. (can add minimum block size for mmap?)
Did you try implementing native call for unmap instead of depending on Sun package. In fact I too introduced dependency on this cleaner code when I try introducing this on write path at DNs in my internal branch. .

I think, if we read different blocks data here, we need to map for each separately right? So, cache gets filled quickly and what is the result with this kind of scenario?

When the cache is full, and someone wants to mmap and additional block, we have to evict something. If something can't be evicted, we simply do not mmap.

In general, mmap will obviously be a much bigger win for programs which access the same files over and over, than for files which have a "scan" type workload. One way to think about it is that mmap reduces the CPU bottleneck of read operations. But if your reads are hitting disk, that bottleneck is irrelevant since you are waiting for I/O. This is inherent in the way mmap works.

Also for smaller files, this may be overhead right as mapping overhead may be more than reads. (can add minimum block size for mmap?)

It seems reasonable for me for people to have small files in the mmap cache if they are read frequently. Whether or not the file is accessed frequently is a better predictor of how much performance benefit we'll get than block size, I believe.

Another thing to note is that this work will eventually be integrated with the ongoing caching work in HDFS-4949. Our first pass will probably be only allowing mmap on blocks which are already in the HDFS-4949 cache.

Did you try implementing native call for unmap instead of depending on Sun package. In fact I too introduced dependency on this cleaner code when I try introducing this on write path at DNs in my internal branch.

Yeah, it's a shame we have to add that. However, not everyone runs with the native code enabled. I would hesitate to add a dependency that was not really needed. What benefit do you think there would be of using JNI for this?

Colin Patrick McCabe
added a comment - 15/Jul/13 21:28 I think you have changed this by mistake?
Yeah, I'm not sure how that got in there.
I think, if we read different blocks data here, we need to map for each separately right? So, cache gets filled quickly and what is the result with this kind of scenario?
When the cache is full, and someone wants to mmap and additional block, we have to evict something. If something can't be evicted, we simply do not mmap.
In general, mmap will obviously be a much bigger win for programs which access the same files over and over, than for files which have a "scan" type workload. One way to think about it is that mmap reduces the CPU bottleneck of read operations. But if your reads are hitting disk, that bottleneck is irrelevant since you are waiting for I/O. This is inherent in the way mmap works.
Also for smaller files, this may be overhead right as mapping overhead may be more than reads. (can add minimum block size for mmap?)
It seems reasonable for me for people to have small files in the mmap cache if they are read frequently. Whether or not the file is accessed frequently is a better predictor of how much performance benefit we'll get than block size, I believe.
Another thing to note is that this work will eventually be integrated with the ongoing caching work in HDFS-4949 . Our first pass will probably be only allowing mmap on blocks which are already in the HDFS-4949 cache.
Did you try implementing native call for unmap instead of depending on Sun package. In fact I too introduced dependency on this cleaner code when I try introducing this on write path at DNs in my internal branch.
Yeah, it's a shame we have to add that. However, not everyone runs with the native code enabled. I would hesitate to add a dependency that was not really needed. What benefit do you think there would be of using JNI for this?

Colin Patrick McCabe
added a comment - 15/Jul/13 22:55
fix IOUtils goof
BlockReaderLocal#close should unref the mmap.
mmaps must be removed from ClientMmapManager#evictable using their lastEvictableTime , not ClientMmapManager#Key .
Refactor some things in ClientMmapManager out into functions. Use java.util.locks.concurrent.Condition , because Object#wait and friends cannot wait on a lock other than the Object's own lock.

Colin Patrick McCabe
added a comment - 16/Jul/13 20:19
fix JavaDoc warning
implement Key#equals to keep FindBugs happy (even though it's never used)
add a suppression for the other 2 warnings which result from FindBugs not understanding that we call a function with the lock held.
The javac warnings come from using the Sun unmap functions-- nothing we can do about those warnings.

This work is very interesting to increase the throughput of HDFS. IMHO, more benchmarks are needed to clarify when to use this feature. For instance:

1. Read large objects which doesn't fit in memory.
2. Basic benchmarks on CentOS 5. Old linux kernels cause page faults with the giant lock when user program touches virtual memory which is mapped to a file on disk by mmap(2). This can be problem for CDH users, because CDH supports CentOS 5[1].

Tsuyoshi Ozawa
added a comment - 26/Jul/13 05:10 This work is very interesting to increase the throughput of HDFS. IMHO, more benchmarks are needed to clarify when to use this feature. For instance:
1. Read large objects which doesn't fit in memory.
2. Basic benchmarks on CentOS 5. Old linux kernels cause page faults with the giant lock when user program touches virtual memory which is mapped to a file on disk by mmap(2). This can be problem for CDH users, because CDH supports CentOS 5 [1] .
[1] http://lwn.net/Articles/200215/

The code includes a fallback path which will be used if zero-copy reads are disabled. Setting dfs.client.mmap.cache.size to 0 will prevent mmap from happening.

RHEL 5 does have some performance problems, but those are outside the scope of this JIRA, I think. If mmap does not provide a performance improvement on that platform, users of that platform can disable it.

Colin Patrick McCabe
added a comment - 26/Jul/13 20:08 The code includes a fallback path which will be used if zero-copy reads are disabled. Setting dfs.client.mmap.cache.size to 0 will prevent mmap from happening.
RHEL 5 does have some performance problems, but those are outside the scope of this JIRA, I think. If mmap does not provide a performance improvement on that platform, users of that platform can disable it.

I'd also like to add that RHEL5 is 6 years old, and most Hadoop users have migrated away from it already. There are a few new features coming down the pipe, like resource management via cgroups, that are not going to be supported on RHEL5.

Colin Patrick McCabe
added a comment - 26/Jul/13 20:09 I'd also like to add that RHEL5 is 6 years old, and most Hadoop users have migrated away from it already. There are a few new features coming down the pipe, like resource management via cgroups, that are not going to be supported on RHEL5.

Tsuyoshi Ozawa
added a comment - 29/Jul/13 15:42 Setting dfs.client.mmap.cache.size to 0 will prevent mmap from happening.
If mmap does not provide a performance improvement on that platform, users of that platform can disable it.
OK. Then we should document them explicitly. Should we create new JIRA to do that?

I looked your patch. Now this patch uses the reference-counter based memory management strategy. Users, therefore, need to manage cursors and have the risk of memory leak. PhantomReference-based memory management strategy can prevent users from memory leak, because we can hook cursors which is unreferenced but still open based on GC's reference. We can do this in another JIRA.

Tsuyoshi Ozawa
added a comment - 31/Jul/13 09:27 Thanks for your work, Colin.
I looked your patch. Now this patch uses the reference-counter based memory management strategy. Users, therefore, need to manage cursors and have the risk of memory leak. PhantomReference-based memory management strategy can prevent users from memory leak, because we can hook cursors which is unreferenced but still open based on GC's reference. We can do this in another JIRA.
http://docs.oracle.com/javase/7/docs/api/java/lang/ref/PhantomReference.html

The way cursors are managed is consistent with the way we manage open files in HDFS. The user opens them, and he must remember to close them later. Garbage collection is not a good solution for memory mapped areas, for the same reason it is not a good solution for open files. The JVM does not realize that the object has additional, non-memory resources associated with it. We might run out of file descriptors long before a GC is triggered.

Colin Patrick McCabe
added a comment - 31/Jul/13 18:40 The way cursors are managed is consistent with the way we manage open files in HDFS. The user opens them, and he must remember to close them later. Garbage collection is not a good solution for memory mapped areas, for the same reason it is not a good solution for open files. The JVM does not realize that the object has additional, non-memory resources associated with it. We might run out of file descriptors long before a GC is triggered.

Just to comment about Phantom references specifically: I do believe that PR are better than finalizers. But they still suffer from the same problem of keeping the object around for longer than it needs to be. In this case, that translates into keeping the mmap around for longer than it needs to be. That consumes page table entries and may prevent us from unmapping a memory map which really has not been used for a long time. This in turn may lead to us not creating another mmap that we should have created, because the cache is full. etc.

My understanding of PR (correct me if I'm wrong) is that you generally have to have a thread that keeps polling the PR to see if it's ready to be disposed of. This is extra overhead for the users who do remember to correctly call close().

Colin Patrick McCabe
added a comment - 31/Jul/13 18:56 Just to comment about Phantom references specifically: I do believe that PR are better than finalizers. But they still suffer from the same problem of keeping the object around for longer than it needs to be. In this case, that translates into keeping the mmap around for longer than it needs to be. That consumes page table entries and may prevent us from unmapping a memory map which really has not been used for a long time. This in turn may lead to us not creating another mmap that we should have created, because the cache is full. etc.
My understanding of PR (correct me if I'm wrong) is that you generally have to have a thread that keeps polling the PR to see if it's ready to be disposed of. This is extra overhead for the users who do remember to correctly call close() .

That consumes page table entries and may prevent us from unmapping a memory map which really has not been used for a long time.

I agree with your opinion. We, however, can co-exist ref count based memory management for advanced users and Phantom reference based memory management for casual users to prevent memory leak as just a option. What do you think?

My understanding of PR (correct me if I'm wrong) is that you generally have to have a thread that keeps polling the PR to see if it's ready to be disposed of. This is extra overhead for the users who do remember to correctly call close().

This is correct. Basically, we need a new thread to poll ReferenceQueue, an event queue for PhantomReference. If this overhead can be problem, we can add a new configuration to switch it on and off.

Tsuyoshi Ozawa
added a comment - 02/Aug/13 04:28 We might run out of file descriptors long before a GC is triggered.
I see, the way of thinking looks like natural for me.
That consumes page table entries and may prevent us from unmapping a memory map which really has not been used for a long time.
I agree with your opinion. We, however, can co-exist ref count based memory management for advanced users and Phantom reference based memory management for casual users to prevent memory leak as just a option. What do you think?
My understanding of PR (correct me if I'm wrong) is that you generally have to have a thread that keeps polling the PR to see if it's ready to be disposed of. This is extra overhead for the users who do remember to correctly call close().
This is correct. Basically, we need a new thread to poll ReferenceQueue, an event queue for PhantomReference. If this overhead can be problem, we can add a new configuration to switch it on and off.

Brandon Li
added a comment - 07/Aug/13 08:14 Colin Patrick McCabe : To remove the SUN package dependency: how about reusing the native buffer for a different file when its ref-count is zero, rather than allocating and deallocating each time?
I didn't find the definition of struct hadoopZeroCopyCursor. Is any header file missed in the patch?
Another thing to note is that this work will eventually be integrated with the ongoing caching work in HDFS-4949
Do you plan to convert it to a sub task of HDFS-4949 and commit to that branch first?

To remove the SUN package dependency: how about reusing the native buffer for a different file when its ref-count is zero, rather than allocating and deallocating each time?

You cannot reuse an mmap for one file on a different file.

I didn't find the definition of struct hadoopZeroCopyCursor. Is any header file missed in the patch?

We typecast struct hadoopZeroCopyCursor to a jobject whenever it is used. Since we don't dereference the type, there is no need for a type definition. Having the type is useful because it prevents callers from passing arbitrary pointers to hadoopZeroCopyRead and other functions.

I don't have a strong feeling about it either way, but I would prefer to get this into trunk, since I don't think there are any dependencies on HDFS-4949. Since there are other outstanding JIRAs on {[BlockReaderLocal}}, it would be nice to avoid conflicts.

Colin Patrick McCabe
added a comment - 07/Aug/13 20:26 To remove the SUN package dependency: how about reusing the native buffer for a different file when its ref-count is zero, rather than allocating and deallocating each time?
You cannot reuse an mmap for one file on a different file.
I didn't find the definition of struct hadoopZeroCopyCursor. Is any header file missed in the patch?
We typecast struct hadoopZeroCopyCursor to a jobject whenever it is used. Since we don't dereference the type, there is no need for a type definition. Having the type is useful because it prevents callers from passing arbitrary pointers to hadoopZeroCopyRead and other functions.
I don't have a strong feeling about it either way, but I would prefer to get this into trunk, since I don't think there are any dependencies on HDFS-4949 . Since there are other outstanding JIRAs on {[BlockReaderLocal}}, it would be nice to avoid conflicts.

Colin Patrick McCabe In the patch, DFSClient has the ClientMmapManager which manages the mmap cache. My impression with HDFS-4949 is that, the mmap pool is managed by DataNode. Is this patch aligned with the centralized cache management design?

Brandon Li
added a comment - 12/Aug/13 22:46 Colin Patrick McCabe In the patch, DFSClient has the ClientMmapManager which manages the mmap cache. My impression with HDFS-4949 is that, the mmap pool is managed by DataNode. Is this patch aligned with the centralized cache management design?

HDFS-4949 is a related, but independent change. In HDFS-4949, the DataNode manages regions of memory using mmap and mlock. The mmap on the DataNode side is done only to make mlock possible. The mmap on the client side is done to allow the client to read from the mmap rather than using the {[read}} syscall. mmap regions are specific to a particular process-- an mmap in the DataNode is of no use to the client process. ClientMmapManager is named the way it is to make it obvious that it's a client thing, not a DataNode thing.

Colin Patrick McCabe
added a comment - 12/Aug/13 23:11 HDFS-4949 is a related, but independent change. In HDFS-4949 , the DataNode manages regions of memory using mmap and mlock . The mmap on the DataNode side is done only to make mlock possible. The mmap on the client side is done to allow the client to read from the mmap rather than using the {[read}} syscall. mmap regions are specific to a particular process-- an mmap in the DataNode is of no use to the client process. ClientMmapManager is named the way it is to make it obvious that it's a client thing, not a DataNode thing.

1. DFSClient: looks like all the DFSClient instances share the same ClientMmapManager instance. If this is the case, why not have one static ClientMmapManager with a refcount to it, and remove ClientMmapManagerFactory class and variable mmapManager?
2. HdfsZeroCopyCursor: might want to also initialize allowShortReads in the constructor.
3. HdfsZeroCopyCursor: readViaSlowPath() throws when shortReads is disallowed but fallbackBuffer is not provided. Since shortReads is false by default, the user has to remember to setFallbackBuffer before doing zero copy read. Not sure which case is more expected by the users, shortReads allowed or disallowed.
4. DFSInputStream: remove unused import and add debug level check for DFSClient.LOG.Debug().
5. TestBlockReader: Assume.assumeTrue(SystemUtils.IS_OS_UNIX), guess you meant IS_OS_LINUX
6. test_libhdfs_zerocopy.c: remove repeated
hdfsBuilderConfSetStr(bld, "dfs.block.size", TO_STR(TEST_ZEROCOPY_FULL_BLOCK_SIZE));
7. TestBlockReaderLocal.java: remove unused import
8. please add javadoc to some classes, e.g., ClientMap,ClientMapManager

Brandon Li
added a comment - 14/Aug/13 02:28 Some comments and questions:
1. DFSClient: looks like all the DFSClient instances share the same ClientMmapManager instance. If this is the case, why not have one static ClientMmapManager with a refcount to it, and remove ClientMmapManagerFactory class and variable mmapManager?
2. HdfsZeroCopyCursor: might want to also initialize allowShortReads in the constructor.
3. HdfsZeroCopyCursor: readViaSlowPath() throws when shortReads is disallowed but fallbackBuffer is not provided. Since shortReads is false by default, the user has to remember to setFallbackBuffer before doing zero copy read. Not sure which case is more expected by the users, shortReads allowed or disallowed.
4. DFSInputStream: remove unused import and add debug level check for DFSClient.LOG.Debug().
5. TestBlockReader: Assume.assumeTrue(SystemUtils.IS_OS_UNIX), guess you meant IS_OS_LINUX
6. test_libhdfs_zerocopy.c: remove repeated
hdfsBuilderConfSetStr(bld, "dfs.block.size", TO_STR(TEST_ZEROCOPY_FULL_BLOCK_SIZE));
7. TestBlockReaderLocal.java: remove unused import
8. please add javadoc to some classes, e.g., ClientMap,ClientMapManager

Overall looks really solid, nice work. I have mostly nitty stuff, only a few potential bugs.

I haven't deduped this with Brandon's comments, apologies.

General

LOG.debug statements should be wrapped in LOG.isDebugEnabled checks, because of the limitations of log4j

hdfs-default.xml

Has some extra lines of java pasted in.

For cache size, mention fds, virtual address space, and application working set, as hints on how to size the cache properly.

The timeout javadoc mentions that the timeout is a minimum, but unreferenced mmaps can be evicted before the timeout when under cache pressure.

ZeroCopyCursor

Let's beef up the class javadoc. We need some place that documents the big picture of ZCR and how to use it, e.g. how to get and use the cursor properly, the purpose of the fallback buffer and when it's used, the implications of the skip checksums and short read options. Right now it's sort of there in the method javadocs, but it's hard to get a sense of how to use it and how it all fits together. An example API usage snippet would be great.

read() javadoc: EOF here refers to an EOF when reading a block, not EOF of the HDFS file. Would prefer to see "end of block".

HdfsZeroCopyCursor

Would like to see explicit setting of allowShortReads to false in the constructor for clarity.

ZeroCopyUnavailableException

serialVersionUID should be private

DFSClient

Comment on the lifecycle of MMAP_MANAGER_FACTORY, it's shared among multiple DFSClients which is why the refcount is important.

Maybe rename put to unref or close? It's not actually "putting" in the data structure sense, which is confusing.

unmap: add to javadoc that it should only be called if the manager has been closed, or by the manager with the lock held.

MappedByteBuffer map=

Need a space before the "=".

ClientMmapManager

In an offline discussion, I asked Colin about using Guava's cache instead of building up this reference counting cache infrastructure. Colin explained that having a background CacheCleaner thread is preferable for more control and being able to do explicit cleanup, which is nice since mmaps use up a file descriptor.

Let's add some javadoc on the lifecycle of cached mmaps, and mention why it's important to cache them (performance). IIUC, it only evicts unreferenced ClientMmaps, and does this either on cache pressure (when we do a fetch) or on a relatively long timescale in the background via the CacheCleaner (15 minutes).

I think fromConf-style factory methods are more normally called get, e.g. FileSystem.get.

Why is the CacheCleaner executor using half the timeout for the delay and period? I'd think the delay can be the timeout (or timeout+period), since nothing will expire naturally before that. For the period, I guess we need some arbitrary staleness bound (and timeout/2 seems reasonable), but this might be worth mentioning in hdfs-default.xml.

The evictable javadoc mentions jittering by a nanosecond, but it's being keyed off of Time.monotonicNow which is milliseconds. We might in fact want to key off of System.nanoTime for fewer collisions.

I think evictOne would be clearer if you used TreeSet#pollFirst rather than an iterator.

Andrew Wang
added a comment - 14/Aug/13 03:00 Overall looks really solid, nice work. I have mostly nitty stuff, only a few potential bugs.
I haven't deduped this with Brandon's comments, apologies.
General
LOG.debug statements should be wrapped in LOG.isDebugEnabled checks, because of the limitations of log4j
hdfs-default.xml
Has some extra lines of java pasted in.
For cache size, mention fds, virtual address space, and application working set, as hints on how to size the cache properly.
The timeout javadoc mentions that the timeout is a minimum, but unreferenced mmaps can be evicted before the timeout when under cache pressure.
ZeroCopyCursor
Let's beef up the class javadoc. We need some place that documents the big picture of ZCR and how to use it, e.g. how to get and use the cursor properly, the purpose of the fallback buffer and when it's used, the implications of the skip checksums and short read options. Right now it's sort of there in the method javadocs, but it's hard to get a sense of how to use it and how it all fits together. An example API usage snippet would be great.
read() javadoc: EOF here refers to an EOF when reading a block, not EOF of the HDFS file. Would prefer to see "end of block".
HdfsZeroCopyCursor
Would like to see explicit setting of allowShortReads to false in the constructor for clarity.
ZeroCopyUnavailableException
serialVersionUID should be private
DFSClient
Comment on the lifecycle of MMAP_MANAGER_FACTORY, it's shared among multiple DFSClients which is why the refcount is important.
Maybe rename put to unref or close ? It's not actually "putting" in the data structure sense, which is confusing.
ClientMmap
let's not call people "bad programmers", just say "accidentally leaked references".
unmap : add to javadoc that it should only be called if the manager has been closed, or by the manager with the lock held.
MappedByteBuffer map=
Need a space before the "=".
ClientMmapManager
In an offline discussion, I asked Colin about using Guava's cache instead of building up this reference counting cache infrastructure. Colin explained that having a background CacheCleaner thread is preferable for more control and being able to do explicit cleanup, which is nice since mmaps use up a file descriptor.
Let's add some javadoc on the lifecycle of cached mmaps, and mention why it's important to cache them (performance). IIUC, it only evicts unreferenced ClientMmaps, and does this either on cache pressure (when we do a fetch) or on a relatively long timescale in the background via the CacheCleaner (15 minutes).
I think fromConf -style factory methods are more normally called get , e.g. FileSystem.get .
Why is the CacheCleaner executor using half the timeout for the delay and period? I'd think the delay can be the timeout (or timeout+period), since nothing will expire naturally before that. For the period, I guess we need some arbitrary staleness bound (and timeout/2 seems reasonable), but this might be worth mentioning in hdfs-default.xml.
The evictable javadoc mentions jittering by a nanosecond, but it's being keyed off of Time.monotonicNow which is milliseconds. We might in fact want to key off of System.nanoTime for fewer collisions.
I think evictOne would be clearer if you used TreeSet#pollFirst rather than an iterator.
Iterator<Entry< Long , ClientMmap>> iter =
evictable.entrySet().iterator();
This has 10 spaces, where elsewhere in the file you use a double-indent of 4.
BlockReaderLocal
Remaining TODO for blocks bigger than 2GB, want to file a follow-on JIRA for this?
readZeroCopy catches and re-sets the interrupted status, does something else check this later?
Is it worth re-trying the mmap after a CacheCleaner period in case some space has been freed up in the cache?
The clientMmap from the manager can be null if the cache is full. I don't see a check for this case.
Tests
Would like to see some tests for cache eviction behavior
How about a Java test without a backing buffer?
JNI test has some commented out fprintfs

brandon wrote: 1. DFSClient: looks like all the DFSClient instances share the same ClientMmapManager instance. If this is the case, why not have one static ClientMmapManager with a refcount to it, and remove ClientMmapManagerFactory class and variable mmapManager?

I think it's better to have the refcount and manager instance encapsulated as private data inside an object, rather than floating around in the DFSClient class, because it prevents errors where someone might access the field without updating the reference count properly.

2. HdfsZeroCopyCursor: might want to also initialize allowShortReads in the constructor.

Users can't create instances of HdfsZeroCopyCursor directly (it's package-private). DFSInputStream#createZeroCopyConstructor creates them. We could start adding booleans to this function, but it seems clearer for people to just use setAllowShortReads. The kind of mess we have with FileSystem#create where there are a dozen different overloads and nobody can keep them straight is an antipattern.

... Not sure which case is more expected by the users, shortReads allowed or disallowed.

That's a good question. My experience has been that many developers don't handle short reads very well (sometimes including me). It's just another corner case that they have to remember to handle, and if they're not FS developers they often don't even realize that it can happen. So I have defaulted it to off unless it's explicitly requested.

I changed this to "code which leaks references accidentally" to make it more boring

unmap: add to javadoc that it should only be called if the manager has been closed, or by the manager with the lock held.

I added "Should be called with the ClientMmapManager lock held"

Need a space before the "=".

ok

Let's add some javadoc on... why it's important to cache [mmaps]

added to ClientMmapManager

I think fromConf-style factory methods are more normally called get, e.g. FileSystem.get.

FileSystem#get uses a cache, whereas ClientMmapManager#fromConf does not. I think it would be confusing to name them similarly...

Why is the CacheCleaner executor using half the timeout for the delay and period?

Half the timeout period is the minimum period for which we can ensure that we time out mmaps on time. Think about if we used the timeout itself as the period. In that case, we might be 1 second away from the 15-minute (or whatever) expiration period when the cleaner thread runs. Then we have to wait another 15 minutes, effectively doubling the timeout.

We might in fact want to key off of System.nanoTime for fewer collisions

Good point; changed.

I think evictOne would be clearer if you used TreeSet#pollFirst rather than an iterator.

yeah, changed

This has 10 spaces, where elsewhere in the file you use a double-indent of 4.

ok, I'll make it 4

Remaining TODO for blocks bigger than 2GB, want to file a follow-on JIRA for this?

filed

readZeroCopy catches and re-sets the interrupted status, does something else check this later?

No. It would only happen if some third-party software delivered an InterruptedException to us. In that case the client is responsible for checking and doing something with the InterruptedException (or not). This all happens in the client thread.

Is it worth re-trying the mmap after a CacheCleaner period in case some space has been freed up in the cache?

BlockReader objects get destroyed and re-created a lot. For example, a long seek will clear the current block reader. So will reading the rest of the block. So I doubt that re-trying will be useful, given that the block reader will probably be gone by that time anyway. Also, hopefully we are not seeing a lot of IOExceptions while trying to mmap, because that indicates some configuration issues.

The clientMmap from the manager can be null if the cache is full. I don't see a check for this case.

That's a very good catch! Fixed.

JNI test has some commented out fprintfs

fixed

Would like to see some tests for cache eviction behavior

It's tricky because of the singleton nature. I'll look at creating another junit test where the limits are set lower. Similar with no backing buffer.

Colin Patrick McCabe
added a comment - 15/Aug/13 02:35 brandon wrote: 1. DFSClient: looks like all the DFSClient instances share the same ClientMmapManager instance. If this is the case, why not have one static ClientMmapManager with a refcount to it, and remove ClientMmapManagerFactory class and variable mmapManager?
I think it's better to have the refcount and manager instance encapsulated as private data inside an object, rather than floating around in the DFSClient class, because it prevents errors where someone might access the field without updating the reference count properly.
2. HdfsZeroCopyCursor: might want to also initialize allowShortReads in the constructor.
Users can't create instances of HdfsZeroCopyCursor directly (it's package-private). DFSInputStream#createZeroCopyConstructor creates them. We could start adding booleans to this function, but it seems clearer for people to just use setAllowShortReads. The kind of mess we have with FileSystem#create where there are a dozen different overloads and nobody can keep them straight is an antipattern.
... Not sure which case is more expected by the users, shortReads allowed or disallowed.
That's a good question. My experience has been that many developers don't handle short reads very well (sometimes including me). It's just another corner case that they have to remember to handle, and if they're not FS developers they often don't even realize that it can happen. So I have defaulted it to off unless it's explicitly requested.
4. DFSInputStream: remove unused import and add debug level check for DFSClient.LOG.Debug().
OK
5. TestBlockReader: Assume.assumeTrue(SystemUtils.IS_OS_UNIX), guess you meant IS_OS_LINUX
mmap is present and supported on other UNIXes besides Linux
6. test_libhdfs_zerocopy.c: remove repeated
fixed
7. TestBlockReaderLocal.java: remove unused import
ok
8. please add javadoc to some classes, e.g., ClientMap,ClientMapManager
ok
andrew wrote: [hdfs-default.xml] Has some extra lines of java pasted in.
fixed
Let's beef up the [zerocopycursor] javadoc
I added an example.
read() javadoc: EOF here refers to an EOF when reading a block, not EOF of the HDFS file. Would prefer to see "end of block".
EOF is only thrown at end-of-file, as described in the JavaDoc.
Would like to see explicit setting of allowShortReads to false in the constructor for clarity.
done
serialVersionUID should be private
ok
Maybe rename put to unref or close? It's not actually "putting" in the data structure sense, which is confusing.
renamed to unref
let's not call people "bad programmers", just say "accidentally leaked references".
I changed this to "code which leaks references accidentally" to make it more boring
unmap: add to javadoc that it should only be called if the manager has been closed, or by the manager with the lock held.
I added "Should be called with the ClientMmapManager lock held"
Need a space before the "=".
ok
Let's add some javadoc on... why it's important to cache [mmaps]
added to ClientMmapManager
I think fromConf-style factory methods are more normally called get, e.g. FileSystem.get.
FileSystem#get uses a cache, whereas ClientMmapManager#fromConf does not. I think it would be confusing to name them similarly...
Why is the CacheCleaner executor using half the timeout for the delay and period?
Half the timeout period is the minimum period for which we can ensure that we time out mmaps on time. Think about if we used the timeout itself as the period. In that case, we might be 1 second away from the 15-minute (or whatever) expiration period when the cleaner thread runs. Then we have to wait another 15 minutes, effectively doubling the timeout.
We might in fact want to key off of System.nanoTime for fewer collisions
Good point; changed.
I think evictOne would be clearer if you used TreeSet#pollFirst rather than an iterator.
yeah, changed
This has 10 spaces, where elsewhere in the file you use a double-indent of 4.
ok, I'll make it 4
Remaining TODO for blocks bigger than 2GB, want to file a follow-on JIRA for this?
filed
readZeroCopy catches and re-sets the interrupted status, does something else check this later?
No. It would only happen if some third-party software delivered an InterruptedException to us. In that case the client is responsible for checking and doing something with the InterruptedException (or not). This all happens in the client thread.
Is it worth re-trying the mmap after a CacheCleaner period in case some space has been freed up in the cache?
BlockReader objects get destroyed and re-created a lot. For example, a long seek will clear the current block reader. So will reading the rest of the block. So I doubt that re-trying will be useful, given that the block reader will probably be gone by that time anyway. Also, hopefully we are not seeing a lot of IOExceptions while trying to mmap, because that indicates some configuration issues.
The clientMmap from the manager can be null if the cache is full. I don't see a check for this case.
That's a very good catch! Fixed.
JNI test has some commented out fprintfs
fixed
Would like to see some tests for cache eviction behavior
It's tricky because of the singleton nature. I'll look at creating another junit test where the limits are set lower. Similar with no backing buffer.

I think the CacheCleaner delay and period could still be improved. An initial delay of timeout still makes sense to me, and maybe we could introduce a conf parameter that affects the period. I doubt that this will be twiddled all that much, so let's set good defaults. A more aggressive default period of timeout/4 might make more sense.

javac warnings still need to be addressed (looks like via the ignore file)

I'd like to see yet more class javadoc for ZeroCopyCursor (man page level). I bet many HDFS users will be eager to use this more efficient interface but won't necessarily understand e.g. ownership of the fallback buffer, and how to handle short vs. not short reads. It's also important since there isn't a design doc. We can defer this to a follow-on JIRA though if you prefer.

Andrew Wang
added a comment - 15/Aug/13 23:58 Great, I think we're very close.
* comapre the genStamp here. That way, we will not return a shorter
typo
return new ClientMmapManager(conf.getInt(DFS_CLIENT_MMAP_CACHE_SIZE,
DFS_CLIENT_MMAP_CACHE_SIZE_DEFAULT),
conf.getLong(DFS_CLIENT_MMAP_CACHE_TIMEOUT_MS,
DFS_CLIENT_MMAP_CACHE_TIMEOUT_MS_DEFAULT));
weird spacing
I think the CacheCleaner delay and period could still be improved. An initial delay of timeout still makes sense to me, and maybe we could introduce a conf parameter that affects the period. I doubt that this will be twiddled all that much, so let's set good defaults. A more aggressive default period of timeout/4 might make more sense.
javac warnings still need to be addressed (looks like via the ignore file)
I'd like to see yet more class javadoc for ZeroCopyCursor (man page level). I bet many HDFS users will be eager to use this more efficient interface but won't necessarily understand e.g. ownership of the fallback buffer, and how to handle short vs. not short reads. It's also important since there isn't a design doc. We can defer this to a follow-on JIRA though if you prefer.

I'll add a config variable for the number of times per expiration period the cleaner thread should run.

javac warnings still need to be addressed (looks like via the ignore file)

There is no ignore file for javac warnings. test-patch.sh just has an awk script that compares the number after the patch to the number before. Unfortunately, we don't have any way to avoid these warnings in the munmap.

Since there's already a lot of javadoc, let's defer adding more documentation to another JIRA. An .apt.vm file might be the right place for this.

Colin Patrick McCabe
added a comment - 16/Aug/13 19:28 I'll add a config variable for the number of times per expiration period the cleaner thread should run.
javac warnings still need to be addressed (looks like via the ignore file)
There is no ignore file for javac warnings. test-patch.sh just has an awk script that compares the number after the patch to the number before. Unfortunately, we don't have any way to avoid these warnings in the munmap.
Since there's already a lot of javadoc, let's defer adding more documentation to another JIRA. An .apt.vm file might be the right place for this.

I'll commit this to the HDFS-4949 branch tomorrow morning unless there are further review comments. This could go into trunk, but much of the benefit comes from the DN performing checksumming and holding mlocks (which is hitting the HDFS-4949 branch first).

Andrew Wang
added a comment - 19/Aug/13 19:23 +1, great work Colin.
I'll commit this to the HDFS-4949 branch tomorrow morning unless there are further review comments. This could go into trunk, but much of the benefit comes from the DN performing checksumming and holding mlocks (which is hitting the HDFS-4949 branch first).

It is supported for all filesystems and the default implementation fills a byte buffer from the underlying stream to the desired length.

It never returns a byte buffer with remaining == 0 except at end of file.

The returned ByteBuffer is only guaranteed to be valid until the next read on the same stream. It will be reused by the next readByteBuffer.

This lets us change the readers to readByteBuffer and take advantage of zero copy if it is available without making two completely different code paths and switching between them when we get an exception.

Owen O'Malley
added a comment - 05/Sep/13 21:39 This API seems overly complicated:
Users want a single interface to read their files, regardless of whether it is zero copy, local, or remote.
Because the base class FSDataInputStream implements the marker class, all filesystems will have the marker class regardless of whether they support zero copy.
The various set/get methods are confusing as to who is supposed to set and who is supposed to get.
Therefore, I'd propose a follow up jira where the API change is extend FSDataInputStream with:
ByteBuffer readByteBuffer(int length) throws IOException;
Relative to the current API:
It is always a partial read. Obviously, zero copy
It is supported for all filesystems and the default implementation fills a byte buffer from the underlying stream to the desired length.
It never returns a byte buffer with remaining == 0 except at end of file.
The returned ByteBuffer is only guaranteed to be valid until the next read on the same stream. It will be reused by the next readByteBuffer.
This lets us change the readers to readByteBuffer and take advantage of zero copy if it is available without making two completely different code paths and switching between them when we get an exception.
Thoughts?

Ok, after a bit more thought, I realize that we can't assume the user is done with the ByteBuffer when they call read the next time. I'd like to propose changing the API in FSDataInputStream to the following:

/**
* Read the file from offset to offset+length-1 into a series of ByteBuffers. If there are fewer
* than length bytes available in the file, an IOException will be thrown. In all other cases, the
* combined length of all of the returned buffers will precisely match the requested length.
* The ByteBuffers may either backed by mmapped memory or byte[] depending on whether the data is
* available locally.
* @param offset the first offset of the file to read
* @param length the total combined length of the file to read
* @return A list of ByteBuffers that precisely cover the requested region of the file.
*/
public ByteBuffer[] readByteBuffers(long offset, int length) throws IOException;
/**
* Add the provided buffers to the FSDataInputStream's pool of byte buffers that are
* available to be reused. The client should not hold references to buffers after they
* have been released.
* @param buffers buffers returned from previous calls to readByteBuffers
*/
public void releaseByteBuffers(ByteBuffer... buffers);

Owen O'Malley
added a comment - 05/Sep/13 23:57 Ok, after a bit more thought, I realize that we can't assume the user is done with the ByteBuffer when they call read the next time. I'd like to propose changing the API in FSDataInputStream to the following:
/**
* Read the file from offset to offset+length-1 into a series of ByteBuffers. If there are fewer
* than length bytes available in the file, an IOException will be thrown. In all other cases, the
* combined length of all of the returned buffers will precisely match the requested length.
* The ByteBuffers may either backed by mmapped memory or byte [] depending on whether the data is
* available locally.
* @param offset the first offset of the file to read
* @param length the total combined length of the file to read
* @ return A list of ByteBuffers that precisely cover the requested region of the file.
*/
public ByteBuffer[] readByteBuffers( long offset, int length) throws IOException;
/**
* Add the provided buffers to the FSDataInputStream's pool of byte buffers that are
* available to be reused. The client should not hold references to buffers after they
* have been released.
* @param buffers buffers returned from previous calls to readByteBuffers
*/
public void releaseByteBuffers(ByteBuffer... buffers);

The new zero-copy read API actually does support both normal and zero-copy reads, based on setting the fallback buffer. This allows flexibility for users:

An unsophisticated user just sets the fallback buffer once for the cursor, and then calls away at the new API. This will ZCR when possible, and falls back to a normal copying read. There isn't a need to switch back and forth. This will also, by default, not return a short read until EOF.

A sophisticated user might want to know if a read involves copying or not, so they can take different actions for fast vs. slow paths. This user would not set a fallback buffer and also would enable short reads. This would only return zero-copy data, and the user has to deal with switching read paths and so on.

I'll also note that it isn't easy for apps to deal with multiple returned buffers, they'll probably basically end up doing the current fallback path of copying them all into a buffer. I think that being a DFS our situation is also different from traditional scatter/gather APIs, since each buffer being collected can have drastically different costs (local zero-copy vs. remote is something like 100x)

Let me know if this makes sense, I might have missed something in your proposal.

Andrew Wang
added a comment - 06/Sep/13 00:39 Hey Owen, thanks for taking a look,
The new zero-copy read API actually does support both normal and zero-copy reads, based on setting the fallback buffer. This allows flexibility for users:
An unsophisticated user just sets the fallback buffer once for the cursor, and then calls away at the new API. This will ZCR when possible, and falls back to a normal copying read. There isn't a need to switch back and forth. This will also, by default, not return a short read until EOF.
A sophisticated user might want to know if a read involves copying or not, so they can take different actions for fast vs. slow paths. This user would not set a fallback buffer and also would enable short reads. This would only return zero-copy data, and the user has to deal with switching read paths and so on.
This togglable behavior was a request from our Impala team, and Arun made a similar request for "is cached" on HDFS-4949 : https://issues.apache.org/jira/browse/HDFS-4949?focusedCommentId=13736019&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13736019
I'll also note that it isn't easy for apps to deal with multiple returned buffers, they'll probably basically end up doing the current fallback path of copying them all into a buffer. I think that being a DFS our situation is also different from traditional scatter/gather APIs, since each buffer being collected can have drastically different costs (local zero-copy vs. remote is something like 100x)
Let me know if this makes sense, I might have missed something in your proposal.

An unsophisticated user just sets the fallback buffer once for the cursor, and then calls away at the new API.

But to the user, it isn't clear if or how to set the fallback buffer. What is the scope of the fallback buffer? How large does it need to be? Should it be direct or byte array? The user wants to read their HDFS data via ByteBuffer (especially if it is available via zero copy). Obviously, my concern is not abstract. I was thinking about how to use this for implementing a reader for the ORC file format and I absolutely need to manage the byte buffers and have a single path for reading both cached and non-cached files. It is much much better API design to have the filesystem create it as needed rather than making the application preemptively create "fallback" byte buffers for the file system to use.

Let's say I need to read 100 MB that may cross a block boundary, under the current API to safely read it, I need to do:

Am I missing something? This is a single read and of course real clients are going to do many of these throughout their code.

Using exceptions for nominal conditions like ZeroCopyUnavailable is bad practice and very expensive since it builds the stack trace.

By requiring allocation of the fallback buffer in all cases, extra allocations will be done.

A sophisticated user might want to know if a read involves copying or not,

At scheduling time they want to know how "local" the data is (cached, local, on-rack, off-rack), but at read time they just want to get the bytes. Making a distinction at read time just complicates the API. Furthermore, if the data crosses a block boundary

I'll also note that it isn't easy for apps to deal with multiple returned buffers

You'll need to return multiple buffers if the read request crosses a block boundary. Mmapped byte buffers can only read a single file. It is much better to have the entire request fulfilled in multiple byte buffers than have to loop externally. Of course in the vast majority of cases that don't cross block boundaries, it will be a single buffer.

Owen O'Malley
added a comment - 06/Sep/13 02:17
An unsophisticated user just sets the fallback buffer once for the cursor, and then calls away at the new API.
But to the user, it isn't clear if or how to set the fallback buffer. What is the scope of the fallback buffer? How large does it need to be? Should it be direct or byte array? The user wants to read their HDFS data via ByteBuffer (especially if it is available via zero copy). Obviously, my concern is not abstract. I was thinking about how to use this for implementing a reader for the ORC file format and I absolutely need to manage the byte buffers and have a single path for reading both cached and non-cached files. It is much much better API design to have the filesystem create it as needed rather than making the application preemptively create "fallback" byte buffers for the file system to use.
Let's say I need to read 100 MB that may cross a block boundary, under the current API to safely read it, I need to do:
FSDataInputStream in = fs.open(path);
in.seek(offset);
List<ByteBuffer> result = new ArrayList<ByteBuffer>();
try {
ZeroCopyCursor cursor = in.createZeroCopyCursor();
// don't fail if we cross block boundaries
cursor.setAllowShortReads( true );
long done = 0;
while (done < len) {
// can't reuse previous buffers since they are still used in result
cursor.setFallbackBuffer(ByteBuffer.allocate(len));
cursor.read(len - done);
ByteBuffer buffer = cursor.getData();
done += buffer.remaining();
result.add(buffer);
}
} catch (ZeroCopyUnavailableException zcu) {
ByteBuffer buffer = ByteBuffer.allocate(len);
IOUtils.readFully(in, buffer.array(), buffer.arrayOffset(), len);
buffer.limit(len);
result.add(buffer);
}
compared to my proposed:
ByteBuffer[] result = in.readByteBuffers(offset, len);
...
in.releaseByteBuffers(result);
Am I missing something? This is a single read and of course real clients are going to do many of these throughout their code.
Using exceptions for nominal conditions like ZeroCopyUnavailable is bad practice and very expensive since it builds the stack trace.
By requiring allocation of the fallback buffer in all cases, extra allocations will be done.
A sophisticated user might want to know if a read involves copying or not,
At scheduling time they want to know how "local" the data is (cached, local, on-rack, off-rack), but at read time they just want to get the bytes. Making a distinction at read time just complicates the API. Furthermore, if the data crosses a block boundary
I'll also note that it isn't easy for apps to deal with multiple returned buffers
You'll need to return multiple buffers if the read request crosses a block boundary. Mmapped byte buffers can only read a single file. It is much better to have the entire request fulfilled in multiple byte buffers than have to loop externally. Of course in the vast majority of cases that don't cross block boundaries, it will be a single buffer.

I'd also note that since ZeroCopyCursor is separate from the FSDataInputStream, it isn't clear that if I seek on the stream whether or not the cursor changes position also. Does read on the cursor move the stream? Creating the separate class of ZeroCopyCursor makes it appear they are separate. If they are in fact separate, you should probably add seek to the ZeroCopyCursor also.

Owen O'Malley
added a comment - 06/Sep/13 02:24 I'd also note that since ZeroCopyCursor is separate from the FSDataInputStream, it isn't clear that if I seek on the stream whether or not the cursor changes position also. Does read on the cursor move the stream? Creating the separate class of ZeroCopyCursor makes it appear they are separate. If they are in fact separate, you should probably add seek to the ZeroCopyCursor also.

The current API is generic and not HDFS-specific. You get a zero-copy cursor from FSDataInputStream#createZeroCopyCursor, and you read from it with ZeroCopyCursor#read. Then, when you're done, you close it with ZeroCopyCursor#close.

It also supports a fallback path. In order to have a fallback path, you must call ZeroCopyCursor#setFallbackBuffer. That provides the cursor with a fallback buffer which will be used when an mmap is unavailable.

The big problem with "well, just return a ByteBuffer, then!" is that ByteBuffer has no close method. So it's unclear how the mmap would ever be released. It is not adequate to rely on the GC, since we are talking about file descriptors here. Furthermore, there are a lot of applications where "valid until next read() call or close of stream" is not good enough. Sometimes people want to do multiple reads and look at the results for each, and we should accommodate them.

Many prospective users of zero-copy are not interested in dealing with many small buffers. They want to deal with either a single big contiguous mmap'ed memory area, or just do reads the standard way, performing many small reads that only access as much as they need. The ability to turn off "fallback mode" (where we fall back to copying to service your read) was very specifically added in response to these users.

I think any reasonable design will end up looking a lot like what we already did in this JIRA. I suppose instead of separating createZeroCopyCursor and read, we could have combined them, but that would have resulted in a function call with a lot more parameters. The current design also prevents the scenario where more and more function variants get added over time with more and more parameters-- the kind of function overload hell we landed in with FileSystem#create.

Colin Patrick McCabe
added a comment - 06/Sep/13 02:50 The current API is generic and not HDFS-specific. You get a zero-copy cursor from FSDataInputStream#createZeroCopyCursor , and you read from it with ZeroCopyCursor#read . Then, when you're done, you close it with ZeroCopyCursor#close .
It also supports a fallback path. In order to have a fallback path, you must call ZeroCopyCursor#setFallbackBuffer . That provides the cursor with a fallback buffer which will be used when an mmap is unavailable.
The big problem with "well, just return a ByteBuffer, then!" is that ByteBuffer has no close method. So it's unclear how the mmap would ever be released. It is not adequate to rely on the GC, since we are talking about file descriptors here. Furthermore, there are a lot of applications where "valid until next read() call or close of stream" is not good enough. Sometimes people want to do multiple reads and look at the results for each, and we should accommodate them.
Many prospective users of zero-copy are not interested in dealing with many small buffers. They want to deal with either a single big contiguous mmap'ed memory area, or just do reads the standard way, performing many small reads that only access as much as they need. The ability to turn off "fallback mode" (where we fall back to copying to service your read) was very specifically added in response to these users.
I think any reasonable design will end up looking a lot like what we already did in this JIRA. I suppose instead of separating createZeroCopyCursor and read , we could have combined them, but that would have resulted in a function call with a lot more parameters. The current design also prevents the scenario where more and more function variants get added over time with more and more parameters-- the kind of function overload hell we landed in with FileSystem#create .

Colin, please read my suggestion and my analysis of the difference before commenting.

The simplified API absolutely provides a means to releasing the ByteBuffer and yet it is 2 lines long instead of 20. Furthermore, I didn't even realize that I was supposed to close the zero copy cursor, since it just came in from closable.

My complaint stands. The API as currently in this branch is very error-prone and difficult to explain. Using it is difficult and requires complex handling including exception handlers to handle arbitrary file systems.

Owen O'Malley
added a comment - 06/Sep/13 22:57 Colin, please read my suggestion and my analysis of the difference before commenting.
The simplified API absolutely provides a means to releasing the ByteBuffer and yet it is 2 lines long instead of 20. Furthermore, I didn't even realize that I was supposed to close the zero copy cursor, since it just came in from closable.
My complaint stands. The API as currently in this branch is very error-prone and difficult to explain. Using it is difficult and requires complex handling including exception handlers to handle arbitrary file systems.

Your proposed API doesn't address one of the big asks we had when designing ZCR, which is to provide a mechanism for notifying the user that he cannot get an mmap. As I mentioned earlier, for performance reasons, many users who might like to have access to a 128 MB mmap segment do not want to copy into a 128MB backing buffer. Doing such a large copy would blow the L2 cache (and possibly the page cache), and rather than improving performance, might degrade it. Similarly, users don't want to get multiple byte buffers back-- the big advantage of mmap is getting a single buffer back (in the cases where that's possible).

What if the user wants to use a direct byte buffer as his fallback buffer? With the current code, that is easy-- I just call setFallbackBuffer(ByteBuffer.allocateDirect(...)). With your proposed API, there's no way to do this.

Creating a new ByteBuffer for each read is going to be slower than reusing the same ByteBuffer-- especially for direct ByteBuffers. Sure, we could have some kind of ByteBuffer cache inside the FSDataInputStream, but that's going to be very complicated. What if someone needs a ByteBuffer of size 100 but we only have ones of size 10 and 900 in the cache? Do we use the big one for the small read or leave it around? How long do we cache them? Do we prefer to the direct ones? And so on. Really, the only design that makes sense is having the user pass in the fallback buffer. We do not want to be re-inventing malloc inside FSDataInputStream.

The design principles of the current API are:

some users want a fallback path, and some don't. We have to satisfy both.

we don't want to manage buffers inside FSDataInputStream. It's a messy and hard problem with no optimal solutions that fit all cases.

nobody wants to receive more than one buffer in response to a read.

most programmers don't correctly handle short reads, so there should be an option to disable them.

One thing that we could and should do is provide a generic fallback path that is independent of filesystem.

Colin Patrick McCabe
added a comment - 07/Sep/13 00:48 Your proposed API doesn't address one of the big asks we had when designing ZCR, which is to provide a mechanism for notifying the user that he cannot get an mmap. As I mentioned earlier, for performance reasons, many users who might like to have access to a 128 MB mmap segment do not want to copy into a 128MB backing buffer. Doing such a large copy would blow the L2 cache (and possibly the page cache), and rather than improving performance, might degrade it. Similarly, users don't want to get multiple byte buffers back-- the big advantage of mmap is getting a single buffer back (in the cases where that's possible).
What if the user wants to use a direct byte buffer as his fallback buffer? With the current code, that is easy-- I just call setFallbackBuffer(ByteBuffer.allocateDirect(...)). With your proposed API, there's no way to do this.
Creating a new ByteBuffer for each read is going to be slower than reusing the same ByteBuffer-- especially for direct ByteBuffers. Sure, we could have some kind of ByteBuffer cache inside the FSDataInputStream, but that's going to be very complicated. What if someone needs a ByteBuffer of size 100 but we only have ones of size 10 and 900 in the cache? Do we use the big one for the small read or leave it around? How long do we cache them? Do we prefer to the direct ones? And so on. Really, the only design that makes sense is having the user pass in the fallback buffer. We do not want to be re-inventing malloc inside FSDataInputStream.
The design principles of the current API are:
some users want a fallback path, and some don't. We have to satisfy both.
we don't want to manage buffers inside FSDataInputStream. It's a messy and hard problem with no optimal solutions that fit all cases.
nobody wants to receive more than one buffer in response to a read.
most programmers don't correctly handle short reads, so there should be an option to disable them.
One thing that we could and should do is provide a generic fallback path that is independent of filesystem.

I've been thinking about this, and I think it might be possible to improve on the current API.

Maybe all we need is something like this:

in DFSInputStream:
ZeroBuffer readZero(ByteBuffer fallback, int maxLength);
ZeroBuffer:
implements Closeable (for close)
implements eof() (returns trueif there are no more bytes to read)
implements all ByteBuffer methods by forwarding them to the enclosed ByteBuffer

This API would be implemented for every filesystem, not just HDFS.

The constraints here would be:

maxLength >= 0

you can't reuse a fallback buffer until you close the associated ZeroBuffer (we can enforce this by throwing an exception in this case)

ZeroBuffers are immutable once created-- until you call close on them.

This gets rid of a few of the awkward issues with the current API, which I think are:

the current API requires users to special-case HDFS (since other FSes throw ZeroCopyUnavailableException)

the current API shares the file position between the cursors and the stream, which is unintuitive.

the current API puts the read call inside the cursor object, which is different than the other read methods.

Colin Patrick McCabe
added a comment - 07/Sep/13 01:33 I've been thinking about this, and I think it might be possible to improve on the current API.
Maybe all we need is something like this:
in DFSInputStream:
ZeroBuffer readZero(ByteBuffer fallback, int maxLength);
ZeroBuffer:
implements Closeable ( for close)
implements eof() (returns true if there are no more bytes to read)
implements all ByteBuffer methods by forwarding them to the enclosed ByteBuffer
This API would be implemented for every filesystem, not just HDFS.
The constraints here would be:
maxLength >= 0
you can't reuse a fallback buffer until you close the associated ZeroBuffer (we can enforce this by throwing an exception in this case)
ZeroBuffers are immutable once created-- until you call close on them.
This gets rid of a few of the awkward issues with the current API, which I think are:
the current API requires users to special-case HDFS (since other FSes throw ZeroCopyUnavailableException)
the current API shares the file position between the cursors and the stream, which is unintuitive.
the current API puts the read call inside the cursor object, which is different than the other read methods.

If an application needs to determine whether zero copy is available, it should be able to do so without catching exceptions.

What happens if the user reads across a block boundary? Most applications don't care about block boundaries and shouldn't have to add special code to cut their requests to block boundaries. That will impose inefficiencies.

The cost of a second level of indirection (app -> ZeroCopy -> ByteBuffer) in the inner loop of the client seems prohibitive.

Requiring pre-allocation of a fallback buffer that hopefully is never needed is really problematic. I'd propose that we flip this around to a factory.

You either need to support short reads or return multiple bytebuffers. I don't see a way to avoid both unless applications are forced to never read across block boundaries. That would be much worse than either of the other options. I'd prefer to have multiple ByteBuffers returned, but if you hate that worse than short reads, I can handle that.

It isn't clear to me how you plan to release mmapped buffers, since Java doesn't provide an API to do that. If you have a mechanism to do that, we need a releaseByteBuffer(ByteBuffer buffer) to release it.

I'd propose that we add the following to FSDataInputStream:

/**
* Is the current location of the stream available via zero copy?
*/
publicboolean isZeroCopyAvailable();
/**
* Read from the current location at least 1 and up to maxLength bytes. In most situations, the returned
* buffer will contain maxLength bytes unless either:
* * the read crosses a block boundary and zero copy is being used
* * the stream has fewer than maxLength bytes left
* The returned buffer will either be one that was created by the factory or a MappedByteBuffer.
*/
public ByteBuffer readByteBuffer(ByteBufferFactory factory, int maxLength) throws IOException;
/**
* Allow application to manage how ByteBuffers are created for fallback buffers.
*/
publicinterface ByteBufferFactory {
ByteBuffer createBuffer(int capacity);
}

Owen O'Malley
added a comment - 10/Sep/13 17:19 Thanks, Colin, for giving more details of the design.
Your new API is much better, but a few issues remain:
If an application needs to determine whether zero copy is available, it should be able to do so without catching exceptions.
What happens if the user reads across a block boundary? Most applications don't care about block boundaries and shouldn't have to add special code to cut their requests to block boundaries. That will impose inefficiencies.
The cost of a second level of indirection (app -> ZeroCopy -> ByteBuffer) in the inner loop of the client seems prohibitive.
Requiring pre-allocation of a fallback buffer that hopefully is never needed is really problematic. I'd propose that we flip this around to a factory.
You either need to support short reads or return multiple bytebuffers. I don't see a way to avoid both unless applications are forced to never read across block boundaries. That would be much worse than either of the other options. I'd prefer to have multiple ByteBuffers returned, but if you hate that worse than short reads, I can handle that.
It isn't clear to me how you plan to release mmapped buffers, since Java doesn't provide an API to do that. If you have a mechanism to do that, we need a releaseByteBuffer(ByteBuffer buffer) to release it.
I'd propose that we add the following to FSDataInputStream:
/**
* Is the current location of the stream available via zero copy?
*/
public boolean isZeroCopyAvailable();
/**
* Read from the current location at least 1 and up to maxLength bytes. In most situations, the returned
* buffer will contain maxLength bytes unless either:
* * the read crosses a block boundary and zero copy is being used
* * the stream has fewer than maxLength bytes left
* The returned buffer will either be one that was created by the factory or a MappedByteBuffer.
*/
public ByteBuffer readByteBuffer(ByteBufferFactory factory, int maxLength) throws IOException;
/**
* Allow application to manage how ByteBuffers are created for fallback buffers.
*/
public interface ByteBufferFactory {
ByteBuffer createBuffer( int capacity);
}

If an application needs to determine whether zero copy is available, it should be able to do so without catching exceptions.

This seems like a fairly arbitrary requirement. What's bad about exceptions for this purpose? Keep in mind, the application always has the choice of providing a fallback buffer, if it doesn't care about whether zero copy is used or not.

public boolean isZeroCopyAvailable();

Unfortunately, zero copy being available is not a binary thing. It might be available sometimes, but not other times. This would be the case if, for example, half of the file was local (and could be mmap'ed), and the other half was only available from remote replicas. We also might try to mmap a block and fail. mmap can fail like any other filesystem operation, because of resource limits or some other reason. So we can't have an API like this.

What happens if the user reads across a block boundary? Most applications don't care about block boundaries and shouldn't have to add special code to cut their requests to block boundaries. That will impose inefficiencies.

This is the point of the fallback buffer-- it will be used when a block boundary, non-local block, etc. prevents the request being fulfilled with an mmap. We've made this as efficient as it can be made.

The cost of a second level of indirection (app -> ZeroCopy -> ByteBuffer) in the inner loop of the client seems prohibitive.

Well, what are the alternatives? We can't subclass java.nio.ByteBuffer, since all of its constructors are package-private. There has to be a close method on whatever we return-- I think that should be clear from the discussion thus far. We could make the close method part of FSDataInputStream, which I think you've suggested in the past. That seems like a strange an non-object-oriented API to me. It also would prevent tools like FindBugs from warning about unclosed mmap objects, which they should do when we inherit from Closeable. It also forces us to maintain some kind of hash table or other lookup structure to find out what mmap information is associated with each byte buffer. The cost of such a lookup is surely higher than a single indirection, especially since after the first access, this will all be in the CPU cache anyway.

Requiring pre-allocation of a fallback buffer that hopefully is never needed is really problematic. I'd propose that we flip this around to a factory.

I'm sort of getting the impression that you plan on having super-huge buffers, which scares me. It's often faster to read into a small buffer many times than into a huge one once, because you blow the cache with such a big read. The factory API also gives me the impression that you plan on allocating a new buffer for each read, which would also be problematic. Please correct me if I'm wrong about these impressions.

If we have a "Factory" object, it needs to have not only a "get" method, but also a "put" method, where ByteBuffers are placed back when we're done with them. At that point, it becomes more like a cache. This might be a reasonable API, but I wonder if the additional complexity is worth it.

You either need to support short reads or return multiple bytebuffers. I don't see a way to avoid both unless applications are forced to never read across block boundaries. That would be much worse than either of the other options. I'd prefer to have multiple ByteBuffers returned, but if you hate that worse than short reads, I can handle that.

We should do short reads.

It isn't clear to me how you plan to release mmapped buffers, since Java doesn't provide an API to do that. If you have a mechanism to do that, we need a releaseByteBuffer(ByteBuffer buffer) to release it.

Colin Patrick McCabe
added a comment - 10/Sep/13 18:58 If an application needs to determine whether zero copy is available, it should be able to do so without catching exceptions.
This seems like a fairly arbitrary requirement. What's bad about exceptions for this purpose? Keep in mind, the application always has the choice of providing a fallback buffer, if it doesn't care about whether zero copy is used or not.
public boolean isZeroCopyAvailable();
Unfortunately, zero copy being available is not a binary thing. It might be available sometimes, but not other times. This would be the case if, for example, half of the file was local (and could be mmap'ed), and the other half was only available from remote replicas. We also might try to mmap a block and fail. mmap can fail like any other filesystem operation, because of resource limits or some other reason. So we can't have an API like this.
What happens if the user reads across a block boundary? Most applications don't care about block boundaries and shouldn't have to add special code to cut their requests to block boundaries. That will impose inefficiencies.
This is the point of the fallback buffer-- it will be used when a block boundary, non-local block, etc. prevents the request being fulfilled with an mmap. We've made this as efficient as it can be made.
The cost of a second level of indirection (app -> ZeroCopy -> ByteBuffer) in the inner loop of the client seems prohibitive.
Well, what are the alternatives? We can't subclass java.nio.ByteBuffer , since all of its constructors are package-private. There has to be a close method on whatever we return-- I think that should be clear from the discussion thus far. We could make the close method part of FSDataInputStream , which I think you've suggested in the past. That seems like a strange an non-object-oriented API to me. It also would prevent tools like FindBugs from warning about unclosed mmap objects, which they should do when we inherit from Closeable. It also forces us to maintain some kind of hash table or other lookup structure to find out what mmap information is associated with each byte buffer. The cost of such a lookup is surely higher than a single indirection, especially since after the first access, this will all be in the CPU cache anyway.
Requiring pre-allocation of a fallback buffer that hopefully is never needed is really problematic. I'd propose that we flip this around to a factory.
I'm sort of getting the impression that you plan on having super-huge buffers, which scares me. It's often faster to read into a small buffer many times than into a huge one once, because you blow the cache with such a big read. The factory API also gives me the impression that you plan on allocating a new buffer for each read, which would also be problematic. Please correct me if I'm wrong about these impressions.
If we have a "Factory" object, it needs to have not only a "get" method, but also a "put" method, where ByteBuffers are placed back when we're done with them. At that point, it becomes more like a cache. This might be a reasonable API, but I wonder if the additional complexity is worth it.
You either need to support short reads or return multiple bytebuffers. I don't see a way to avoid both unless applications are forced to never read across block boundaries. That would be much worse than either of the other options. I'd prefer to have multiple ByteBuffers returned, but if you hate that worse than short reads, I can handle that.
We should do short reads.
It isn't clear to me how you plan to release mmapped buffers, since Java doesn't provide an API to do that. If you have a mechanism to do that, we need a releaseByteBuffer(ByteBuffer buffer) to release it.
Check the current code in the HDFS-4949 to see how we did it.

If an application needs to determine whether zero copy is available, it should be able to do so without catching exceptions.

I thought about this a little more, and I think what you want is probably to look at the read statistics to see how many bytes were read via mmap. If you're using the fallback path, you shouldn't need to care about whether each individual read was mmap or not. Does that make sense for your use case?

I've been thinking a little more about the need to pre-allocate a ByteBuffer that might not be needed (in my proposed API). I guess to get an idea of whether this is a problem, we need to know how big these buffers will usually be, and how many threads there will be. I envisioned these buffers just being a meg or two. (That's why I made the "huge buffers scare me" comment). Anyone care to comment on whether this assumption was accurate?

Colin Patrick McCabe
added a comment - 10/Sep/13 20:06 If an application needs to determine whether zero copy is available, it should be able to do so without catching exceptions.
I thought about this a little more, and I think what you want is probably to look at the read statistics to see how many bytes were read via mmap. If you're using the fallback path, you shouldn't need to care about whether each individual read was mmap or not. Does that make sense for your use case?
Check the current code in the HDFS-4949 to see how we did it.
I guess I should comment a little more about munmap in Java. There are basically two ways to do it: a sun-specific API, and using JNI. We opted for the former to avoid having a JNI dependency.
There are also some comments earlier in this JIRA: see https://issues.apache.org/jira/browse/HDFS-4953?focusedCommentId=13708905&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13708905
I've been thinking a little more about the need to pre-allocate a ByteBuffer that might not be needed (in my proposed API). I guess to get an idea of whether this is a problem, we need to know how big these buffers will usually be, and how many threads there will be. I envisioned these buffers just being a meg or two. (That's why I made the "huge buffers scare me" comment). Anyone care to comment on whether this assumption was accurate?

I envisioned these buffers just being a meg or two. (That's why I made the "huge buffers scare me" comment). Anyone care to comment on whether this assumption was accurate?

I agree that this is best for performance. In benchmarks I've done of the existing SCR implementation, I found that the max throughput was achieved for buffers of 1MB, 2MB, and 4MB, and any larger started to see a performance hit of about 30-40% due to increased cache misses.

Todd Lipcon
added a comment - 10/Sep/13 20:32 I envisioned these buffers just being a meg or two. (That's why I made the "huge buffers scare me" comment). Anyone care to comment on whether this assumption was accurate?
I agree that this is best for performance. In benchmarks I've done of the existing SCR implementation, I found that the max throughput was achieved for buffers of 1MB, 2MB, and 4MB, and any larger started to see a performance hit of about 30-40% due to increased cache misses.

The external ByteBuffer "factory/cache/pool" idea is worth exploring. It seems to me that a ByteBufferPool interface (e.g. with ByteBuffer getBuffer(int) and void recycle(ByteBuffer...) methods) would help the API (ByteBuffer readByteBuffer(ByteBufferPool, ...), as the a ByteBufferPool implementation can handle client specific needs and keep the buffer cache logic out of FSDataInputStream.

Luke Lu
added a comment - 10/Sep/13 21:32 The external ByteBuffer "factory/cache/pool" idea is worth exploring. It seems to me that a ByteBufferPool interface (e.g. with ByteBuffer getBuffer(int) and void recycle(ByteBuffer...) methods) would help the API (ByteBuffer readByteBuffer(ByteBufferPool, ...), as the a ByteBufferPool implementation can handle client specific needs and keep the buffer cache logic out of FSDataInputStream.

Actually, it is a standard best practice. Building exception objects is expensive and handling exceptions is error-prone.

Unfortunately, zero copy being available is not a binary thing. It might be available sometimes, but not other times.

I clearly stated in the previous comment it was about whether the zero copy was available AT THE CURRENT POINT IN THE STREAM.

This is the point of the fallback buffer-- it will be used when a block boundary, non-local block, etc. prevents the request being fulfilled with an mmap. We've made this as efficient as it can be made.

No, you haven't made it efficient at all. If the code is going to do a buffer copy and use the fallback buffer if it crosses a block, the performance will be dreadful. With my interface, you get no extra allocation, and no buffer copies.

The two reasonable strategies are:

short read to cut it to a record boundary

return multiple buffers

You can't make it easier than that, because the whole point of this API is to avoid buffer copies. You can't violate the goal of the interface because you don't like those choices.

Well, what are the alternatives? We can't subclass java.nio.ByteBuffer, since all of its constructors are package-private.

Agreed that you can't extend ByteBuffer. But other than close, you don't really need anything.

There has to be a close method on whatever we return

That is NOT a requirement. Especially when it conflicts with fundamental performance. As I wrote, adding a return method on the stream is fine.

I'm sort of getting the impression that you plan on having super-huge buffers, which scares me.

For ORC, I need to read typically 200 MB at a time. Obviously, if I don't need the whole range, I won't get it, but getting it in large sets of bytes is much much more efficient than lots of little reads.

The factory API also gives me the impression that you plan on allocating a new buffer for each read, which would also be problematic.

No, that is not the case. The point of the API is to avoid allocating the buffers if they aren't needed. The current API requires a buffer whether it is needed or not. Obviously the application will need a cache of the buffers to reuse, but the factory lets them write efficient code.

If we have a "Factory" object, it needs to have not only a "get" method, but also a "put" method, where ByteBuffers are placed back when we're done with them. At that point, it becomes more like a cache. This might be a reasonable API, but I wonder if the additional complexity is worth it.

Of course the implementations of the factory will have a release method. The question was just whether the FSDataInputStream needed to access the release method. If we add the releaseByteBuffer then we'd need the release to the factory.

Based on this, I'd propose:

/**
* Is the current location of the stream available via zero copy?
*/
publicboolean isZeroCopyAvailable();
/**
* Read from the current location at least 1 and up to maxLength bytes. In most situations, the returned
* buffer will contain maxLength bytes unless either:
* * the read crosses a block boundary and zero copy is being used
* * the stream has fewer than maxLength bytes left
* The returned buffer will either be one that was created by the factory or a MappedByteBuffer.
*/
public ByteBuffer readByteBuffer(ByteBufferFactory factory, int maxLength) throws IOException;
/**
* Release a buffer that was returned from readByteBuffer. If the method was created by the factory
* it will be returned to the factory.
*/
public void releaseByteBuffer(ByteBufferFactory factory, ByteBuffer buffer);
/**
* Allow application to manage how ByteBuffers are created for fallback buffers. Only buffers created by
* the factory will be released to it.
*/
publicinterface ByteBufferFactory {
ByteBuffer createBuffer(int capacity);
void releaseBuffer(ByteBuffer buffer);
}

This will allow applications to:

determine whether zero copy is available for the next read

the user can use the same read interface for all filesystems and files, using zero copy if available

no extra buffer copies

no bytebuffers are allocated if they are not needed

applications have to deal with short reads, but only get a single byte buffer

allow applications to create buffer managers that reuse buffers

allow applications to control whether direct or byte[] byte buffers are used

Owen O'Malley
added a comment - 10/Sep/13 22:17
This seems like a fairly arbitrary requirement
Actually, it is a standard best practice. Building exception objects is expensive and handling exceptions is error-prone.
Unfortunately, zero copy being available is not a binary thing. It might be available sometimes, but not other times.
I clearly stated in the previous comment it was about whether the zero copy was available AT THE CURRENT POINT IN THE STREAM.
This is the point of the fallback buffer-- it will be used when a block boundary, non-local block, etc. prevents the request being fulfilled with an mmap. We've made this as efficient as it can be made.
No, you haven't made it efficient at all. If the code is going to do a buffer copy and use the fallback buffer if it crosses a block, the performance will be dreadful. With my interface, you get no extra allocation, and no buffer copies.
The two reasonable strategies are:
short read to cut it to a record boundary
return multiple buffers
You can't make it easier than that, because the whole point of this API is to avoid buffer copies. You can't violate the goal of the interface because you don't like those choices.
Well, what are the alternatives? We can't subclass java.nio.ByteBuffer, since all of its constructors are package-private.
Agreed that you can't extend ByteBuffer. But other than close, you don't really need anything.
There has to be a close method on whatever we return
That is NOT a requirement. Especially when it conflicts with fundamental performance. As I wrote, adding a return method on the stream is fine.
I'm sort of getting the impression that you plan on having super-huge buffers, which scares me.
For ORC, I need to read typically 200 MB at a time. Obviously, if I don't need the whole range, I won't get it, but getting it in large sets of bytes is much much more efficient than lots of little reads.
The factory API also gives me the impression that you plan on allocating a new buffer for each read, which would also be problematic.
No, that is not the case. The point of the API is to avoid allocating the buffers if they aren't needed. The current API requires a buffer whether it is needed or not. Obviously the application will need a cache of the buffers to reuse, but the factory lets them write efficient code.
If we have a "Factory" object, it needs to have not only a "get" method, but also a "put" method, where ByteBuffers are placed back when we're done with them. At that point, it becomes more like a cache. This might be a reasonable API, but I wonder if the additional complexity is worth it.
Of course the implementations of the factory will have a release method. The question was just whether the FSDataInputStream needed to access the release method. If we add the releaseByteBuffer then we'd need the release to the factory.
Based on this, I'd propose:
/**
* Is the current location of the stream available via zero copy?
*/
public boolean isZeroCopyAvailable();
/**
* Read from the current location at least 1 and up to maxLength bytes. In most situations, the returned
* buffer will contain maxLength bytes unless either:
* * the read crosses a block boundary and zero copy is being used
* * the stream has fewer than maxLength bytes left
* The returned buffer will either be one that was created by the factory or a MappedByteBuffer.
*/
public ByteBuffer readByteBuffer(ByteBufferFactory factory, int maxLength) throws IOException;
/**
* Release a buffer that was returned from readByteBuffer. If the method was created by the factory
* it will be returned to the factory.
*/
public void releaseByteBuffer(ByteBufferFactory factory, ByteBuffer buffer);
/**
* Allow application to manage how ByteBuffers are created for fallback buffers. Only buffers created by
* the factory will be released to it.
*/
public interface ByteBufferFactory {
ByteBuffer createBuffer( int capacity);
void releaseBuffer(ByteBuffer buffer);
}
This will allow applications to:
determine whether zero copy is available for the next read
the user can use the same read interface for all filesystems and files, using zero copy if available
no extra buffer copies
no bytebuffers are allocated if they are not needed
applications have to deal with short reads, but only get a single byte buffer
allow applications to create buffer managers that reuse buffers
allow applications to control whether direct or byte[] byte buffers are used
The example code would look like:
FSDataInputStream in = fs.open(path);
in.seek(100*1024*1024);
List<ByteBuffer> buffers = new ArrayList<ByteBuffer>();
long remaining = 100 * 1024 * 1024;
while (remaining > 0) {
ByteBuffer buf = in.readByteBuffer(fallbackFactory, 100*1024*1024);
remaining -= buf.remaining();
buffers.add(buf);
}
...
for (ByteBuffer buf: buffers) {
in.releaseByteBuffer(fallbackFactory, buf);
}

Actually, it is a standard best practice. Building exception objects is expensive and handling exceptions is error-prone.

There's no bundling going on here. DFSClient throws the exception, not a remote daemon.

Can you be more clear about what the alternative is to an exception here? It seems like a null return would have the same issues. (unless I am missing something?)

I clearly stated in the previous comment [that isZeroCopyAvailable] was about whether the zero copy was available AT THE CURRENT POINT IN THE STREAM.

You can't know ahead of time whether your call to mmap will succeed. As I said, mmap can fail, for dozens of reasons. And of course blocks move over time. There is a fundamental "time of check, time of use" (TOCTOU) race condition in this kind of API.

[block boundary comments]

Unless I'm missing something, I think we're both in agreement here-- we return a short read when near the block boundary, to avoid using the fallback buffer.

[factory comments]

OK. I misinterpreted what you said earlier. I see now that you plan on reusing buffers.

Since I am not knowledgeable about ORC, I have to ask you, Owen: is it necessary to read 200 MB at a time to decode this file format? As Todd wrote, our benchmarks show that large reads may not be the most efficient due to cache effects.

[new API]

This proposal seems a lot better than the previous ones we've come up with.

There is already a method named FSDataInputStream#read(ByteBuffer buf) in FSDataInputStream. If we create a new method named FSDataInputStream#readByteBuffer, I would expect there to be some confusion between the two. That's why I proposed FSDataInputStream#readZero for the new name. Does that make sense?

If, instead of returning a ByteBuffer from the readByteBuffer call, we returned a ZeroBuffer object wrapping the ByteBuffer, we could simply call ZeroBuffer#close(). The ZeroBuffer itself could retain a reference to its factory (if it came from a factory), and the DFSInputStream (if it needed to be munmapped). This seems a lot more object-oriented to me. To make the client pass in this information seems error-prone and confusing.

I don't buy the argument that having a wrapper object imposes an "unacceptable performance hit." As I mentioned before, if you don't have a wrapper object, you have to do a hash table lookup in DFSInputStream#releaseByteBuffer to see if the buffer being released is one that we mmap'ed. A hash table lookup is surely slower than a single object dereference.

I'd like to get some other prospective zero-copy API users to comment on whether they like the wrapper object or the DFSInputStream#releaseByteBuffer approach better...

Colin Patrick McCabe
added a comment - 10/Sep/13 23:53 Actually, it is a standard best practice. Building exception objects is expensive and handling exceptions is error-prone.
There's no bundling going on here. DFSClient throws the exception, not a remote daemon.
Can you be more clear about what the alternative is to an exception here? It seems like a null return would have the same issues. (unless I am missing something?)
I clearly stated in the previous comment [that isZeroCopyAvailable] was about whether the zero copy was available AT THE CURRENT POINT IN THE STREAM.
You can't know ahead of time whether your call to mmap will succeed. As I said, mmap can fail, for dozens of reasons. And of course blocks move over time. There is a fundamental "time of check, time of use" (TOCTOU) race condition in this kind of API.
[block boundary comments]
Unless I'm missing something, I think we're both in agreement here-- we return a short read when near the block boundary, to avoid using the fallback buffer.
[factory comments]
OK. I misinterpreted what you said earlier. I see now that you plan on reusing buffers.
Since I am not knowledgeable about ORC, I have to ask you, Owen: is it necessary to read 200 MB at a time to decode this file format? As Todd wrote, our benchmarks show that large reads may not be the most efficient due to cache effects.
[new API]
This proposal seems a lot better than the previous ones we've come up with.
There is already a method named FSDataInputStream#read(ByteBuffer buf) in FSDataInputStream . If we create a new method named FSDataInputStream#readByteBuffer , I would expect there to be some confusion between the two. That's why I proposed FSDataInputStream#readZero for the new name. Does that make sense?
public void releaseByteBuffer(ByteBufferFactory factory, ByteBuffer buffer);
If, instead of returning a ByteBuffer from the readByteBuffer call, we returned a ZeroBuffer object wrapping the ByteBuffer , we could simply call ZeroBuffer#close() . The ZeroBuffer itself could retain a reference to its factory (if it came from a factory), and the DFSInputStream (if it needed to be munmapped). This seems a lot more object-oriented to me. To make the client pass in this information seems error-prone and confusing.
I don't buy the argument that having a wrapper object imposes an "unacceptable performance hit." As I mentioned before, if you don't have a wrapper object, you have to do a hash table lookup in DFSInputStream#releaseByteBuffer to see if the buffer being released is one that we mmap'ed. A hash table lookup is surely slower than a single object dereference.
I'd like to get some other prospective zero-copy API users to comment on whether they like the wrapper object or the DFSInputStream#releaseByteBuffer approach better...

You can't know ahead of time whether your call to mmap will succeed. As I said, mmap can fail, for dozens of reasons. And of course blocks move over time. There is a fundamental "time of check, time of use" (TOCTOU) race condition in this kind of API.

Ok, I guess I'm fine with the exception assuming the user passed in a null factory. It will be expensive in terms of time, but it won't affect the vast majority of users.

is it necessary to read 200 MB at a time to decode the ORC file format?

Actually, yes. The set of rows that are written together is large (typically ~200MB) so that reading them is efficient. For a 100 column table, that means that you have all of the values for column 1 in the first ~2MB, followed by all of the values for column 2 in the next 2MB, etc. To read the first row, you need all 100 of the 2MB sections.

Obviously mmapping this is much more efficient, because the pages of the file can be brought in as needed.

There is already a method named FSDataInputStream#read(ByteBuffer buf) in FSDataInputStream. If we create a new method named FSDataInputStream#readByteBuffer, I would expect there to be some confusion between the two. That's why I proposed FSDataInputStream#readZero for the new name. Does that make sense?

I see your point, but readZero, which sounds like it just fills zeros into a byte buffer, doesn't convey the right meaning. The fundamental action that the user is taking is in fact read. I'd propose that we overload it with the other read and comment it saying that this read supports zero copy while the other doesn't. How does this look?

/**
* Read a byte buffer from the stream using zero copy if possible. Typically the read will return
* maxLength bytes, but it may return fewer at the end of the file system block or the end of the
* file.
* @param factory a factory that creates ByteBuffers for the read if the region of the file can't be
* mmapped.
* @param maxLength the maximum number of bytes that will be returned
* @return a ByteBuffer with between 1 and maxLength bytes from the file. The buffer should be released
* to the stream when the user is done with it.
*/
public ByteBuffer read(ByteBufferFactory factory, int maxLength) throws IOException;

I'd like to get some other prospective zero-copy API users to comment on whether they like the wrapper object or the DFSInputStream#releaseByteBuffer approach better...

Uh, that is exactly what is happening. I'm a user who is trying to use this interface for a very typical use case of quickly reading bytes that may or may not be on the local machine. I also
care a lot about APIs and have been working on Hadoop for 7.75 years.

If, instead of returning a ByteBuffer from the readByteBuffer call, we returned a ZeroBuffer object wrapping the ByteBuffer, we could simply call ZeroBuffer#close()

Users don't want to make interfaces for reading from some Hadoop type named ZeroBuffer. The user wants a ByteBuffer because it is a standard Java type. To make this concrete and crystal clear, I have to make Hive and ORC work with both Hadoop 1.x and Hadoop 2.x. Therefore, if you use a non-standard type I need to wrap it in a shim. That sucks. Especially, if it is in the inner loop, which this absolutely would be. I need a ByteBuffer because I can make a shim that always returns a ByteBuffer that works regardless of which version of Hadoop that the user is using.

Owen O'Malley
added a comment - 11/Sep/13 16:54
You can't know ahead of time whether your call to mmap will succeed. As I said, mmap can fail, for dozens of reasons. And of course blocks move over time. There is a fundamental "time of check, time of use" (TOCTOU) race condition in this kind of API.
Ok, I guess I'm fine with the exception assuming the user passed in a null factory. It will be expensive in terms of time, but it won't affect the vast majority of users.
is it necessary to read 200 MB at a time to decode the ORC file format?
Actually, yes. The set of rows that are written together is large (typically ~200MB) so that reading them is efficient. For a 100 column table, that means that you have all of the values for column 1 in the first ~2MB, followed by all of the values for column 2 in the next 2MB, etc. To read the first row, you need all 100 of the 2MB sections.
Obviously mmapping this is much more efficient, because the pages of the file can be brought in as needed.
There is already a method named FSDataInputStream#read(ByteBuffer buf) in FSDataInputStream. If we create a new method named FSDataInputStream#readByteBuffer, I would expect there to be some confusion between the two. That's why I proposed FSDataInputStream#readZero for the new name. Does that make sense?
I see your point, but readZero, which sounds like it just fills zeros into a byte buffer, doesn't convey the right meaning. The fundamental action that the user is taking is in fact read. I'd propose that we overload it with the other read and comment it saying that this read supports zero copy while the other doesn't. How does this look?
/**
* Read a byte buffer from the stream using zero copy if possible. Typically the read will return
* maxLength bytes, but it may return fewer at the end of the file system block or the end of the
* file.
* @param factory a factory that creates ByteBuffers for the read if the region of the file can't be
* mmapped.
* @param maxLength the maximum number of bytes that will be returned
* @ return a ByteBuffer with between 1 and maxLength bytes from the file. The buffer should be released
* to the stream when the user is done with it.
*/
public ByteBuffer read(ByteBufferFactory factory, int maxLength) throws IOException;
I'd like to get some other prospective zero-copy API users to comment on whether they like the wrapper object or the DFSInputStream#releaseByteBuffer approach better...
Uh, that is exactly what is happening. I'm a user who is trying to use this interface for a very typical use case of quickly reading bytes that may or may not be on the local machine. I also
care a lot about APIs and have been working on Hadoop for 7.75 years.
If, instead of returning a ByteBuffer from the readByteBuffer call, we returned a ZeroBuffer object wrapping the ByteBuffer, we could simply call ZeroBuffer#close()
Users don't want to make interfaces for reading from some Hadoop type named ZeroBuffer. The user wants a ByteBuffer because it is a standard Java type. To make this concrete and crystal clear, I have to make Hive and ORC work with both Hadoop 1.x and Hadoop 2.x. Therefore, if you use a non-standard type I need to wrap it in a shim. That sucks. Especially, if it is in the inner loop, which this absolutely would be. I need a ByteBuffer because I can make a shim that always returns a ByteBuffer that works regardless of which version of Hadoop that the user is using.

Suresh Srinivas
added a comment - 12/Sep/13 04:46 There is general consensus that the existing APIs can be cleaned up. I like what Owen O'Malley is proposing. I am going to create a new jira where we can decide the final details of the API.

Colin Patrick McCabe
added a comment - 12/Sep/13 15:52 I agree. We should definitely do this cleanup.
Since ZCR is currently in the HDFS-4949 branch, I created a branch subtask for it at HDFS-5191 where we can continue the discussion.