Details

Description

Currently map-reduce applications (specifically file-based input-formats) use FileSystem.getFileBlockLocations to compute splits. However they are forced to call it once per file.
The downsides are multiple:

Even with a few thousand files to process the number of RPCs quickly starts getting noticeable

The current implementation of getFileBlockLocations is too slow since each call results in 'search' in the namesystem. Assuming a few thousand input files it results in that many RPCs and 'searches'.

It would be nice to have a FileSystem.getFileBlockLocations which can take in a directory, and return the block-locations for all files in that directory. We could eliminate both the per-file RPC and also the 'search' by a 'scan'.

When I tested this for terasort, a moderate job with 8000 input files the runtime halved from the current 8s to 4s. Clearly this is much more important for latency-sensitive applications...

Activity

An alternative to passing directories might be to pass a list of files. The request might get larger, but this is more precise, e.g., when only a subset of files in a directory will be used only that subset need be passed. Since globbing is client-side, this requires two round trips, one to list files and one to list their blocks, but that would still be a huge improvement over per-file RPC.

Doug Cutting
added a comment - 08/May/09 18:27 An alternative to passing directories might be to pass a list of files. The request might get larger, but this is more precise, e.g., when only a subset of files in a directory will be used only that subset need be passed. Since globbing is client-side, this requires two round trips, one to list files and one to list their blocks, but that would still be a huge improvement over per-file RPC.

How about adding something like:
Map<FileStatus, BlockLocation[]> listBlockLocations(Path[]);
This would permit a glob-free job to get everything it needs in a single RPC, and a globbing job to do so with two RPCs.

Doug Cutting
added a comment - 08/May/09 18:48 How about adding something like:
Map<FileStatus, BlockLocation[]> listBlockLocations(Path[]);
This would permit a glob-free job to get everything it needs in a single RPC, and a globbing job to do so with two RPCs.

The question is whether we should modify LocatedBlocks, which would include the map proposed by Doug and extend the semantics of getBlockLocations() to handle directories, or should we introduce a new method (rpc) getBlockLocations(srcDir) returning LocatedBlockMap.
Is there a reason to keep current per file getBlockLocations() if we had a more generic method?

Konstantin Shvachko
added a comment - 08/May/09 19:14 Currently getBlockLocations(src, offset, length) returns a class called LocatedBlocks , which contains a list of LocatedBlock belonging to the file.
public class LocatedBlocks implements Writable {
private long fileLength;
private List<LocatedBlock> blocks; // array of blocks with prioritized locations
}
The question is whether we should modify LocatedBlocks , which would include the map proposed by Doug and extend the semantics of getBlockLocations() to handle directories, or should we introduce a new method (rpc) getBlockLocations(srcDir) returning LocatedBlockMap .
Is there a reason to keep current per file getBlockLocations() if we had a more generic method?

If we adopt the approach that Doug has suggested, then the namenode still has to search for each input path in the file system namespace. This approach still has the advantage that the number of RPC calls are reduced. If we adopt Arun's proposal that specifies a directory and the RPC-call returns the splits of all the files in that directory, then it reduces the number of searches in the FS namespace as well as the number of RPC calls. I was kind-of leaning towards Arun's proposal, but Doug's approach is a little more flexible in nature, isn't it?

dhruba borthakur
added a comment - 10/May/09 10:40 If we adopt the approach that Doug has suggested, then the namenode still has to search for each input path in the file system namespace. This approach still has the advantage that the number of RPC calls are reduced. If we adopt Arun's proposal that specifies a directory and the RPC-call returns the splits of all the files in that directory, then it reduces the number of searches in the FS namespace as well as the number of RPC calls. I was kind-of leaning towards Arun's proposal, but Doug's approach is a little more flexible in nature, isn't it?

Dhruba, I was thinking it was implict in Doug's proposal that if one of the paths in the Path[] is a directory, then the new api would return block-locations of all its' children (non-recursively?) which would satisfy the original requirement. Doug, can you please confirm?

Arun C Murthy
added a comment - 11/May/09 18:50 Dhruba, I was thinking it was implict in Doug's proposal that if one of the paths in the Path[] is a directory, then the new api would return block-locations of all its' children (non-recursively?) which would satisfy the original requirement. Doug, can you please confirm?

Yes, I had assumed that any directories in the request would be expanded. The goal is to have something we can call from FileInputFormat, which takes a list of patterns. When the patterns contain no wildcards, we should be able to create splits with a single RPC to the NameNode. So the semantics should match those of FileInputFormat in this case.

Doug Cutting
added a comment - 11/May/09 23:39 > Doug, can you please confirm?
Yes, I had assumed that any directories in the request would be expanded. The goal is to have something we can call from FileInputFormat, which takes a list of patterns. When the patterns contain no wildcards, we should be able to create splits with a single RPC to the NameNode. So the semantics should match those of FileInputFormat in this case.

I think the rpc call itself should not be recursive. It is like with ls: the getListing() call is non-recursive, but the client recursively calls getListing() on sub-directories.
The idea is to prevent people from making a mistake to call getBlockLocation("/") on large directory trees recursively, which may freeze the name-node for a long period of time.
Non-recursive variant should be sufficient to cover Arun's use case.

Konstantin Shvachko
added a comment - 12/May/09 02:16 > (non-recursively?)
I think the rpc call itself should not be recursive. It is like with ls: the getListing() call is non-recursive, but the client recursively calls getListing() on sub-directories.
The idea is to prevent people from making a mistake to call getBlockLocation("/") on large directory trees recursively, which may freeze the name-node for a long period of time.
Non-recursive variant should be sufficient to cover Arun's use case.

The "inputPaths" can be a set of files and/or directories. If one of the inputPaths is a directory, then all items inside that directory (only one level, not recursive) are listed and their block locations are returned by this call. if one of the inputPaths is a file, then its block locations are returned by this call. The FileStatus returned by this call should have the absolulte path of the object being returned.

dhruba borthakur
added a comment - 12/May/09 14:35 Ok, so from what I can understand, here is the proposal:
Map<FileStatus, BlockLocation[]> listBlockLocations(Path[] inputPaths);
The "inputPaths" can be a set of files and/or directories. If one of the inputPaths is a directory, then all items inside that directory (only one level, not recursive) are listed and their block locations are returned by this call. if one of the inputPaths is a file, then its block locations are returned by this call. The FileStatus returned by this call should have the absolulte path of the object being returned.

A further clarification: should subdirectories be included, with empty block lists, or elided? My hunch is to eliminate them, so that every FileStatus returned is for a plain file--no directories. Does that sound right to others?

Doug Cutting
added a comment - 12/May/09 21:09 Dhruba: yes, that sounds right to me.
A further clarification: should subdirectories be included, with empty block lists, or elided? My hunch is to eliminate them, so that every FileStatus returned is for a plain file--no directories. Does that sound right to others?

Raghu Angadi
added a comment - 28/May/09 22:56
I see why the interface takes array or paths. But not sure why it returns a map (not that there is anything wrong it). This is probably the only RPC returning a map in Hadoop.
How does a user figure out which were valid and which were invalid/non-existent/empty paths? May be user does not care?
'getBlockLocations()' returns the blocks (sort of) sorted w.r.t client. Should this interface do that too? M/R use case does not need that sorted.
Is there any other interface that resembles this?

dhruba borthakur
added a comment - 29/May/09 01:37 > Is there any other interface that resembles this?
The only thing that comes relatively close to this one is the READDIRPLUS operation in NFS. This call is more like getFileStatusBulk() for HDFS.

Doug Cutting
added a comment - 02/Jun/09 00:02 > But not sure why it returns a map
It could perhaps instead return an array of two-element structs, each containing a filestatis/blocklocations pair, but a Map seems simpler.
> How does a user figure out which were valid and which were invalid/non-existent/empty paths?
Non-existent paths should be ignored. Paths whose URIs are for different filesystems or are somehow unparseable should cause an exception.

The current API implementation of getBlockLocations includes parameters for the byte offset within the file and the number of bytes within the files for which to return blocks. These parameters aren't provided currently in the specification for the new API. Would it be better to pass in an array of BlockLocationRequests, each of which would consist of the path, start and length?

The other option would be to add the start and length specifications and for them to apply to each of the paths within the array, which doesn't seem particularly useful.

Jakob Homan
added a comment - 16/Jun/09 01:52 The current API implementation of getBlockLocations includes parameters for the byte offset within the file and the number of bytes within the files for which to return blocks. These parameters aren't provided currently in the specification for the new API. Would it be better to pass in an array of BlockLocationRequests, each of which would consist of the path, start and length?
The other option would be to add the start and length specifications and for them to apply to each of the paths within the array, which doesn't seem particularly useful.

dhruba borthakur
added a comment - 16/Jun/09 06:09 > pass in an array of BlockLocationRequests, each of which would consist of the path, start and length
+1. This sounds better than assuming that we need to send back all blocks for the specified path(s)..

> Would it be better to pass in an array of BlockLocationRequests, each of which would consist of the path, start and length?

That is more general, but what is the use case? The motivating use case for listBlockLocations() is map reduce split construction, which typically takes a list of files as input, not a list of sections of files. Adding a feature that won't be used will just make this new API harder to use. -1 without a compelling use case.

Doug Cutting
added a comment - 16/Jun/09 17:48 > Would it be better to pass in an array of BlockLocationRequests, each of which would consist of the path, start and length?
That is more general, but what is the use case? The motivating use case for listBlockLocations() is map reduce split construction, which typically takes a list of files as input, not a list of sections of files. Adding a feature that won't be used will just make this new API harder to use. -1 without a compelling use case.

The original motivation was Arun and Owen noticing during the terasort work that there were a large number of rpc calls were made during the task scheduling and that a bulk method could ameliorate that. That seems reasonable to me. I'll let Arun lobby further.

One point that came up in discussions is that it would be a good idea to have a maximum number of files that can be returned at once in order to not overwhelm the namenode. Whether this is hard-coded or configurable was not decided.

Jakob Homan
added a comment - 16/Jun/09 18:37 That is more general, but what is the use case?
The original motivation was Arun and Owen noticing during the terasort work that there were a large number of rpc calls were made during the task scheduling and that a bulk method could ameliorate that. That seems reasonable to me. I'll let Arun lobby further.
One point that came up in discussions is that it would be a good idea to have a maximum number of files that can be returned at once in order to not overwhelm the namenode. Whether this is hard-coded or configurable was not decided.

Doug Cutting
added a comment - 16/Jun/09 20:05 I meant, what is the use case for passing in start/end positions per file? I support the idea of a bulk call, but don't see the need to pass start/end positions per file.

I think the extended version of the API would help in doing incremental distcp when hdfs-append is supported. We use "distcp -update" to do an incremental copy of files that have changed in length, but having this proposed extended API (and more) allows distcp to copy only changed portions of a file.

dhruba borthakur
added a comment - 16/Jun/09 20:56 I think the extended version of the API would help in doing incremental distcp when hdfs-append is supported. We use "distcp -update" to do an incremental copy of files that have changed in length, but having this proposed extended API (and more) allows distcp to copy only changed portions of a file.

Quick note: making the length mandatory (i.e. part of the api) has the unfortunate side-effect of forcing a stat on each file apriori to the call to listBlockLocations. So, from a Map-Reduce perspective it is important to have an api which does not force InputFormats to pass in the lengths.

OTOH if we really need the more general version of the api I'd like to pass in "-1" to imply the whole file.

Arun C Murthy
added a comment - 17/Jun/09 02:42 Quick note: making the length mandatory (i.e. part of the api) has the unfortunate side-effect of forcing a stat on each file apriori to the call to listBlockLocations. So, from a Map-Reduce perspective it is important to have an api which does not force InputFormats to pass in the lengths.
OTOH if we really need the more general version of the api I'd like to pass in "-1" to imply the whole file.

> I think the extended version of the API would help in doing incremental distcp when hdfs-append is supported.

Thanks for the use case! An append-savvy incremental distcp might first use listStatus to get all file lengths and dates from both filesystems, then figure out which had grown longer but whose creation dates had not changed, indicating they'd been appended to. Then a batch call could be made to fetch block locations of just newly appended sections, and these would be used to construct splits that can be localized well. Does that sound right?

In this case we would not list directories, but rather always pass in a list of individual files. The mapping from inputs to outputs would be 1:1 so it could take the form:

BlockLocation[] getBlockLocations(BlockLocationRequest[])

A corollary is that it does not make sense to pass start/end positions for a directory, although these could be ignored.

Do we want to try to develop a single swiss-army-knife batch call, or add operation-optimized calls as we go?

Doug Cutting
added a comment - 17/Jun/09 20:16 > I think the extended version of the API would help in doing incremental distcp when hdfs-append is supported.
Thanks for the use case! An append-savvy incremental distcp might first use listStatus to get all file lengths and dates from both filesystems, then figure out which had grown longer but whose creation dates had not changed, indicating they'd been appended to. Then a batch call could be made to fetch block locations of just newly appended sections, and these would be used to construct splits that can be localized well. Does that sound right?
In this case we would not list directories, but rather always pass in a list of individual files. The mapping from inputs to outputs would be 1:1 so it could take the form:
BlockLocation[] getBlockLocations(BlockLocationRequest[])
A corollary is that it does not make sense to pass start/end positions for a directory, although these could be ignored.
Do we want to try to develop a single swiss-army-knife batch call, or add operation-optimized calls as we go?

Sanjay Radia
added a comment - 08/Sep/09 22:14 Is the optimization for sending only partial block reports really necessary? Most files have very few blocks ...
Also arun's point of doing an extra call for doing the getFileStatus() is valid.
Why not create a class called DetailedFileStatus which contains both the file status and block locations:
DetailedFileStatus[] = getBlockLocations(Path[] paths); // 1:1 mapping between the two arrays as Doug suggested.
We can add the range one later if we really need that optimization.

> Is the optimization for sending only partial block reports really necessary?

It may help the append-savvy distcp use case, but is not in the mapred job submission use case. Even in the append-savvy distcp use case, it's not clear that it's required. Maybe we should punt that until someone develops an append-savvy distcp?

> Why not create a class called DetailedFileStatus which contains both the file status and block locations:

Why is DetailedFileStatus[] better than Map<FileStatus,BlockLocation[]>? The latter seems more transparent.

Doug Cutting
added a comment - 09/Sep/09 20:08 > Is the optimization for sending only partial block reports really necessary?
It may help the append-savvy distcp use case, but is not in the mapred job submission use case. Even in the append-savvy distcp use case, it's not clear that it's required. Maybe we should punt that until someone develops an append-savvy distcp?
> Why not create a class called DetailedFileStatus which contains both the file status and block locations:
Why is DetailedFileStatus[] better than Map<FileStatus,BlockLocation[]>? The latter seems more transparent.
> DetailedFileStatus[] = getBlockLocations(Path[] paths); // 1:1 mapping between the two arrays as Doug suggested.
That was intended for the append-savvy distcp use case. The original use case was for mapred job submission, where we typically have a list of directories. With directories there is not a 1:1 mapping.

> Maybe we should punt that until someone develops an append-savvy distcp?
+1

>Why is DetailedFileStatus[] better than Map<FileStatus,BlockLocation[]>? The latter seems more transparent.
I was holding out on a file system interface return a map. But that is old school.
Fine I am convinced.

I suspect you also want the rpc signature to return a map (that makes me more nervous because most rpcs do not support that - but ours does I guess.).

Wrt to the new FileContext api, my proposal is that its provides a single getBlockLocation method:

Map<FileStatus,BlockLocation[]> getBlockLocations(Path[] path)

and abandon the BlockLocation[] getBlockLocations(path, start, end).

(of course FileSystem will continue to support the old getBlockLocations.)

Sanjay Radia
added a comment - 15/Sep/09 17:45 > Maybe we should punt that until someone develops an append-savvy distcp?
+1
>Why is DetailedFileStatus[] better than Map<FileStatus,BlockLocation[]>? The latter seems more transparent.
I was holding out on a file system interface return a map. But that is old school.
Fine I am convinced.
I suspect you also want the rpc signature to return a map (that makes me more nervous because most rpcs do not support that - but ours does I guess.).
Wrt to the new FileContext api, my proposal is that its provides a single getBlockLocation method:
Map<FileStatus,BlockLocation[]> getBlockLocations(Path[] path)
and abandon the BlockLocation[] getBlockLocations(path, start, end).
(of course FileSystem will continue to support the old getBlockLocations.)

Hairong Kuang
added a comment - 13/Jul/10 00:58 I am quite bothered that the proposed API returns a map. Is the reason for returning a map because the API does one-level listPath? Is there a use case that needs only one-level expansion?
If we eventually need to get the block locations of all files recursively under the input paths, is the following API a better choice?
/**
* @ return the block locations of all files recursively under the input paths
*/
Iterator<BlockLocation> getBlockLocations(Path[] paths)
When implementing this in HDFS, we might need to issue multiple RPCs and be very careful to limit the size of each RPC request and response.

I read FileInputFormat and understand the usecase much better. So the client needs to know FileStatus for filtering and there is a configuration parameter to specify whether the input paths need to be traversed recursively. In this case, how about the following revised API?

Hairong Kuang
added a comment - 13/Jul/10 01:26 I read FileInputFormat and understand the usecase much better. So the client needs to know FileStatus for filtering and there is a configuration parameter to specify whether the input paths need to be traversed recursively. In this case, how about the following revised API?
class FileStatusAndBlockLocations {
FileStatus fileStatus;
BlockLocation [] blocks;
}
Iterator<FileStatusAndBlockLocations> getBlockLocations(Path[] paths, boolean isRecursive);

Hairong Kuang
added a comment - 15/Jul/10 01:08 The above proposed method is an API in FileSystem.
Internally in HDFS, I plan to add two new client-to-namenode RPCs:
class HdfsFileStatusAndBlockLocations
{
HdfsFileStatus fileStatus;
BlockLocation [] blocks;
}
/**
Given an array of input paths, return an array of file status and block locations.
The input array and output array have the same size.
The ith item in the output array is the file status and block locations of the ith path in input array.
if an input path is a directory, its block locations is empty.
*/
HdfsFileStatusAndBlockLocations[] getFileStatusAndBlockLocations( Path[] paths);
/**
Given an input directory, return the file status and block locations of its children.
*/
HdfsFileStatusAndBlockLocations[] listFileStatusAndBlockLocations(Path path);
Suppose the subtrees that represent a job's input paths contain N directories, the two APIs allow a dfs client to issue N+1 RPCs to NameNode to implement the above proposed file system API.

Hairong Kuang
added a comment - 15/Jul/10 01:27 I also plan to use the same idea of iterative listing ( HDFS-985 ) to limit the size of the response when listingFileStatusAndBlockLocations of a directory.

I want to explain the difference between my proposal and the previous proposal.
1. For the FileSystem API, the user can specify whether the input paths need to be recursively traversed or not. The return result is an iterator, which allows the input files to be fetched from server one batch at a time so to avoid OOM exception when input paths are huge.
2. The design of new RPCs allows us to return HdfsFileStatus (local file name) instead of FileStatus (full path name), saving CPU processing time. It also allows us to easily limit the response size.

Hairong Kuang
added a comment - 16/Jul/10 19:08 I want to explain the difference between my proposal and the previous proposal.
1. For the FileSystem API, the user can specify whether the input paths need to be recursively traversed or not. The return result is an iterator, which allows the input files to be fetched from server one batch at a time so to avoid OOM exception when input paths are huge.
2. The design of new RPCs allows us to return HdfsFileStatus (local file name) instead of FileStatus (full path name), saving CPU processing time. It also allows us to easily limit the response size.
If nobody is against it, I will go ahead with the implementation.

dhruba borthakur
added a comment - 17/Jul/10 00:34 +1 to this proposal.
> The return result is an iterator, which allows the input files to be fetched from
However, if the number of files in a diectory are few (say 500), then we can still fetch everything in on RPC, isn't it?

> if the number of files in a diectory are few (say 500), then we can still fetch everything in on RPC, isn't it?
I will reuse DFS_LIST_LIMIT introduced in HDFS-985. Its default value is 1000. So by default, 500 will be fetched in one RPC.

Hairong Kuang
added a comment - 17/Jul/10 01:21 > if the number of files in a diectory are few (say 500), then we can still fetch everything in on RPC, isn't it?
I will reuse DFS_LIST_LIMIT introduced in HDFS-985 . Its default value is 1000. So by default, 500 will be fetched in one RPC.

Taking multiple paths as an input to a FileContext API and HDFS clinet-NN rpc seems to be a bad idea. It adds quite a lot of complexity for grouping paths by file systems and for resolving symbolic links. Does not sound clean and I'd like to avoid it. So here is the revised proposal:

This new API is similar to FileContext#listStatus in many ways except that the returned LocatedFileStatus contains its block locations and if isRecursive is true, all the files in the subtree rooted at the input path will be returned.

Hairong Kuang
added a comment - 21/Jul/10 22:42 Taking multiple paths as an input to a FileContext API and HDFS clinet-NN rpc seems to be a bad idea. It adds quite a lot of complexity for grouping paths by file systems and for resolving symbolic links. Does not sound clean and I'd like to avoid it. So here is the revised proposal:
class LocatedFileStatus extends FileStatus {
BlockLocation [] blocks;
}
FileSystem and FileContext will have a new API
public Iterator<FileStatusAndBlockLocations> listLocatedFileStatus(Path path, boolean isRecursive);
This new API is similar to FileContext#listStatus in many ways except that the returned LocatedFileStatus contains its block locations and if isRecursive is true, all the files in the subtree rooted at the input path will be returned.
Similarly in HDFS, we will have
class HdfsLocatedFileStatus extends HdfsFileStaus {
BlockLocations[] blocks;
}
ClientProtocol will add one more parameter "boolean withLocation" to the existing getListing RPC.
public DirectoryListing getListing( String src,
byte [] startAfter,
boolean withLocation)
throws AccessControlException, FileNotFoundException,
UnresolvedLinkException, IOException;
If withLocation is false, the semantics is the same as before. When withLocations is true, DirectoryListing will contains LocatedFileStatus.

I am not sure what should we do if a child of the input directory is a symbolic link. Whether the symbolic link should be resolved or not better to be decided by applications.

It seems cleaner if the new API changes to be listLocatedFileStatus(Path path) so it does not traverse the subtree recursively and it returns all the content of the directory. BlockLocations are piggybacked if a child is a file. This design decision leaves the questions like how to deal with when a child is a symbolic link or a directory to be answered by applications.

Hairong Kuang
added a comment - 23/Jul/10 20:12 I am not sure what should we do if a child of the input directory is a symbolic link. Whether the symbolic link should be resolved or not better to be decided by applications.
It seems cleaner if the new API changes to be listLocatedFileStatus(Path path) so it does not traverse the subtree recursively and it returns all the content of the directory. BlockLocations are piggybacked if a child is a file. This design decision leaves the questions like how to deal with when a child is a symbolic link or a directory to be answered by applications.

Hairong Kuang
added a comment - 23/Jul/10 22:42 Hi Doug, thanks for your review comments.
Yes, Handling of symlinks should be addressed in FileContext in HADOOP-6870 . HDFS-202 severs as the discussion board for this issue. So I posted the question here.
My question is whether this new API should handle recursive traversal and symbolic resolution. Is it cleaner if it does not do any of these and leave decisions to applications?

> My question is whether this new API should handle recursive traversal and symbolic resolution.

My intuition is that recursive file listings for open should follow symbolic links, since open follows symbolic links. Recursive traversal for remove should not follow symbolic links, but should just remove the symbolic link, like remove does on a symbolic link.

Doug Cutting
added a comment - 23/Jul/10 23:19 > My question is whether this new API should handle recursive traversal and symbolic resolution.
My intuition is that recursive file listings for open should follow symbolic links, since open follows symbolic links. Recursive traversal for remove should not follow symbolic links, but should just remove the symbolic link, like remove does on a symbolic link.

Hairong Kuang
added a comment - 27/Jul/10 05:42 hdfsListFiles1.patch support listFiles API in both DistributedFileSystem and Hdfs.
It has two unit tests. In particular, TestListFilesInFileContext includes a test on a input directory that contains two symbolic links.

getNext() should call hasNext(). A caller calling next() without calling hasNext() could result in not fetching new partial list and also ArrayIndexOutOfBoundsExeption.

DistributedFileSystem

listLocatedStatus() make src final.

listFiles() make itor, curFile private and dirStats final.

HDFSFileLocatedStatus.java - missing banner.

FSNamesystem.java - make getBlockLocationsInternal() private

NameNode.java - change variable hasLocation to needLocation. Method getListing() variant with boolean flag is just used by fsck? Do we need that variant?

General - as in hadoop common implementation, on IOException, should next() and hasNext() throw RuntimeException instead of returning false. Please note the comment in HADOOP-6870 about if this is right way to handle FileNotFoundException.

Suresh Srinivas
added a comment - 29/Jul/10 22:33
HDFS.java
Not sure about "NB:" in the comment
DirListingIterator
make f, src, needLocation final. Add javadoc to the class.
getNext() should call hasNext(). A caller calling next() without calling hasNext() could result in not fetching new partial list and also ArrayIndexOutOfBoundsExeption.
DistributedFileSystem
listLocatedStatus() make src final.
listFiles() make itor , curFile private and dirStats final.
HDFSFileLocatedStatus.java - missing banner.
FSNamesystem.java - make getBlockLocationsInternal() private
NameNode.java - change variable hasLocation to needLocation. Method getListing() variant with boolean flag is just used by fsck? Do we need that variant?
General - as in hadoop common implementation, on IOException, should next() and hasNext() throw RuntimeException instead of returning false. Please note the comment in HADOOP-6870 about if this is right way to handle FileNotFoundException.

By default I think exceptions should be thrown. This is like the return status of the unix 'ls' command, which is non-zero if, e.g., a directory is unreadable. But perhaps an option to force enumeration in light of exception would be useful.

Doug Cutting
added a comment - 30/Jul/10 01:28 By default I think exceptions should be thrown. This is like the return status of the unix 'ls' command, which is non-zero if, e.g., a directory is unreadable. But perhaps an option to force enumeration in light of exception would be useful.

Unix 'ls' returns all the results in one shot. However, when getting response iteratively the behavior is different:

When listing a single directory, if some ls results has been returned and the directory is deleted, we should throw FileNotFoundException, to indicate the directory is no longer available.

When recursively listing under a directory, if a subdirectory is deleted, the more appropriate response is to ignore FileNotFound for that directory and return the remaining results. This would be consistent with what the result would be, if the command is repeated. Further, if an application is listing recursively a large directory, the state of the directory keeps changing, an application may have to try many times to list it.

Suresh Srinivas
added a comment - 30/Jul/10 01:42 Unix 'ls' returns all the results in one shot. However, when getting response iteratively the behavior is different:
When listing a single directory, if some ls results has been returned and the directory is deleted, we should throw FileNotFoundException, to indicate the directory is no longer available.
When recursively listing under a directory, if a subdirectory is deleted, the more appropriate response is to ignore FileNotFound for that directory and return the remaining results. This would be consistent with what the result would be, if the command is repeated. Further, if an application is listing recursively a large directory, the state of the directory keeps changing, an application may have to try many times to list it.

As I commented in HADOOP-6890, I would prefer throwing exceptions when a file/directory is deleted during listing. This is because getFiles is used by MapReduce job client to calculate splits. So the expectation is that the input directories remain no change during job execution. It is good to fail the job earlier than later.

Hairong Kuang
added a comment - 30/Jul/10 17:41 As I commented in HADOOP-6890 , I would prefer throwing exceptions when a file/directory is deleted during listing. This is because getFiles is used by MapReduce job client to calculate splits. So the expectation is that the input directories remain no change during job execution. It is good to fail the job earlier than later.

Hadoop QA
added a comment - 04/Aug/10 05:54 -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12451084/hdfsListFiles3.patch
against trunk revision 982091.
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 15 new or modified tests.
-1 patch. The patch command could not apply the patch.
Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/228/console
This message is automatically generated.

Not able to run ant test-patch because the trunk does not compile. But I checked that this patch does not introduce new Javadoc warnings and adds new tests. There were quite a few unit tests failing. But seems not related to this patch.

Hairong Kuang
added a comment - 11/Aug/10 20:41 Not able to run ant test-patch because the trunk does not compile. But I checked that this patch does not introduce new Javadoc warnings and adds new tests. There were quite a few unit tests failing. But seems not related to this patch.

Hairong Kuang
added a comment - 11/Aug/10 21:45 Konstantin, the hdfs trunk should be able to compile because I've committed this patch. HDFS-202 is the HDFS side of HADOOP-6900 !
Thanks Suresh for reviewing this patch at full speed!