Details

Added HDFS file access times. By default, access times will be precise to the most recent hour boundary. A configuration parameter dfs.access.time.precision (milliseconds) is used to control this precision. Setting a value of 0 will disable persisting access times for HDFS files.

Added HDFS file access times. By default, access times will be precise to the most recent hour boundary. A configuration parameter dfs.access.time.precision (milliseconds) is used to control this precision. Setting a value of 0 will disable persisting access times for HDFS files.

Description

HDFS should support some type of statistics that allows an administrator to determine when a file was last accessed.

Since HDFS does not have quotas yet, it is likely that users keep on accumulating files in their home directories without much regard to the amount of space they are occupying. This causes memory-related problems with the namenode.

Access times are costly to maintain. AFS does not maintain access times. I thind DCE-DFS does maintain access times with a coarse granularity.

One proposal for HDFS would be to implement something like an "access bit".
1. This access-bit is set when a file is accessed. If the access bit is already set, then this call does not result in a transaction.
2. A FileSystem.clearAccessBits() indicates that the access bits of all files need to be cleared.

An administrator can effectively use the above mechanism (maybe a daily cron job) to determine files that are recently used.

Activity

Another approach might be to process the namenode logs (if they're kept) to find which files have been accessed when.

Also, in HDFS, would access times really be that expensive? We have relatively few files and relatively many blocks. So increasing the data structure size of a file shouldn't be that costly. The larger expense might be logging each time a file is opened. How bad would that be? Perhaps we could make it optional?

Doug Cutting
added a comment - 10/Sep/07 20:30 Another approach might be to process the namenode logs (if they're kept) to find which files have been accessed when.
Also, in HDFS, would access times really be that expensive? We have relatively few files and relatively many blocks. So increasing the data structure size of a file shouldn't be that costly. The larger expense might be logging each time a file is opened. How bad would that be? Perhaps we could make it optional?
I'm just brainstorming...

Currently on our bigger grids, we have a significant amount of files that we aren't sure whether anyone is actually using or not (e.g., /tmp). While I recognize that atime is a huge performance killer, whenever one deals with users who have free reign over their space, it is an incredibly important tool to help maintain the system.

This is especially important given the lack of ACLs. On our larger grids, there are many files that are just kind of scattered all over that we have no real insight as to their purpose, much less their usage pattern. All we know is that we didn't put them there.

Having an access log would at least tell us whether something is being used. Once users get added to the system, having the user information combined with whether a file was touched will be extremely handy.

Operationally, I see this being used by dumping the data on a regular interval into an RDBMS or perhaps even inside HDFS itself. It is then fairly trivial to create tools and form policies around data retention.

Allen Wittenauer
added a comment - 10/Sep/07 20:31 Currently on our bigger grids, we have a significant amount of files that we aren't sure whether anyone is actually using or not (e.g., /tmp). While I recognize that atime is a huge performance killer, whenever one deals with users who have free reign over their space, it is an incredibly important tool to help maintain the system.
This is especially important given the lack of ACLs. On our larger grids, there are many files that are just kind of scattered all over that we have no real insight as to their purpose, much less their usage pattern. All we know is that we didn't put them there.
Having an access log would at least tell us whether something is being used. Once users get added to the system, having the user information combined with whether a file was touched will be extremely handy.
Operationally, I see this being used by dumping the data on a regular interval into an RDBMS or perhaps even inside HDFS itself. It is then fairly trivial to create tools and form policies around data retention.

I am not too worried about data structure size. But if we have to write a transaction for every file access, that could be a performance killer, do you agree? My proposal was to somehow batch access times updates to the disk (using the access bit). Using the access-bit and a cron job that runs every hour could effectively provide a coarse-grain-access-time.

dhruba borthakur
added a comment - 10/Sep/07 22:16 I am not too worried about data structure size. But if we have to write a transaction for every file access, that could be a performance killer, do you agree? My proposal was to somehow batch access times updates to the disk (using the access bit). Using the access-bit and a cron job that runs every hour could effectively provide a coarse-grain-access-time.

> if we have to write a transaction for every file access, that could be a performance killer, do you agree?

I don't know. Logging file opens should be good enough, right? How much is transaction logging a bottleneck currently? How much worse would this make it? If files average ten or more blocks, and we're reading files not much more often than we're writing them, then the impact might be small.

Another option to consider is making this a separate log that's buffered, since its data is not as critical to filesystem function. We could flush the buffer every minute or so, so that when the namenode crashes we'd lose only the last minute of access time updates. Might that be acceptable?

Doug Cutting
added a comment - 10/Sep/07 22:30 > if we have to write a transaction for every file access, that could be a performance killer, do you agree?
I don't know. Logging file opens should be good enough, right? How much is transaction logging a bottleneck currently? How much worse would this make it? If files average ten or more blocks, and we're reading files not much more often than we're writing them, then the impact might be small.
Another option to consider is making this a separate log that's buffered, since its data is not as critical to filesystem function. We could flush the buffer every minute or so, so that when the namenode crashes we'd lose only the last minute of access time updates. Might that be acceptable?

> Another option to consider is making this a separate log that's buffered

This is a pretty good option. It should cost almost nothing to write the access to a buffered log. I suspect that for the use case Allen describes it would be sufficient to discover the outliers i.e. files and directories that haven't been accessed in months.

Sameer Paranjpye
added a comment - 10/Sep/07 23:08 > Another option to consider is making this a separate log that's buffered
This is a pretty good option. It should cost almost nothing to write the access to a buffered log. I suspect that for the use case Allen describes it would be sufficient to discover the outliers i.e. files and directories that haven't been accessed in months.

I agree that we can record the timestamp when an "open" occurred in namenode memory, mark the inode as dirty and then allow a async flush daemon to persist these dirty inodes to disk once every period. That should be optimal and should be light-weight.

dhruba borthakur
added a comment - 10/Sep/07 23:40 I agree that we can record the timestamp when an "open" occurred in namenode memory, mark the inode as dirty and then allow a async flush daemon to persist these dirty inodes to disk once every period. That should be optimal and should be light-weight.

Working on some job, i have realized that nearly all the MR jobs somehow create temporary directories/files. Due to various reasons(program bug, program crash, etc. ), these may not be deleted properly. We may add a FileSystem#createTempFIle() and attach it with a timestamp. And after some period, these can deleted by batch tasks.

Enis Soztutar
added a comment - 11/Sep/07 14:56 Working on some job, i have realized that nearly all the MR jobs somehow create temporary directories/files. Due to various reasons(program bug, program crash, etc. ), these may not be deleted properly. We may add a FileSystem#createTempFIle() and attach it with a timestamp. And after some period, these can deleted by batch tasks.

Now that we have permissions and ownerships, such an access log would also need to differentiate between reads, writes, ownership changes, and permission changes. This would be extremely helpful in case forensics need to be performed to track down someone doing Bad Things(tm).

Allen Wittenauer
added a comment - 08/Jan/08 01:23 Now that we have permissions and ownerships, such an access log would also need to differentiate between reads, writes, ownership changes, and permission changes. This would be extremely helpful in case forensics need to be performed to track down someone doing Bad Things(tm).

1. Add a 4 byte field to the in-memory inode to maintain access time.
2. Create a new transaction OP_SET_ACCESSTIME to the edits log.
3. Every access to read data/metadata for a file will do the following:
– update the in-memory access time of the inode.
– write a OP_SET_ACCESSTIME entry for this inode to the edits log buffer, do not sync or flush buffer
4. Enhance the dfs shell/webUI to display access times of files/directories

This should not adversely impact the transaction processing rate of the namenode. Other types of transactions (e.g. file creation) will anyway cause the transaction-log-buffer to get synced to disk pretty quickly. This implementation will not distinguish between different kind of metadata accesses and is primarily targeted to weed out files that are not used for a long long time.

dhruba borthakur
added a comment - 18/Aug/08 22:25 I plan on doing the following:
1. Add a 4 byte field to the in-memory inode to maintain access time.
2. Create a new transaction OP_SET_ACCESSTIME to the edits log.
3. Every access to read data/metadata for a file will do the following:
– update the in-memory access time of the inode.
– write a OP_SET_ACCESSTIME entry for this inode to the edits log buffer, do not sync or flush buffer
4. Enhance the dfs shell/webUI to display access times of files/directories
This should not adversely impact the transaction processing rate of the namenode. Other types of transactions (e.g. file creation) will anyway cause the transaction-log-buffer to get synced to disk pretty quickly. This implementation will not distinguish between different kind of metadata accesses and is primarily targeted to weed out files that are not used for a long long time.

Raghu Angadi
added a comment - 18/Aug/08 22:45 Most tests seem to indicate amount of data synced matters as well (along with number of syncs). I will be surprised if a benchmark that tests mixed load (say 10% writes, 90% reads) is not impacted.

this looks good enough. i am wondering if u can also log the username somewhere (either the edit log or the namenode log). we are potentially interested in userid's reading specific parts of the file system (and then maintaining last reader information at the directory level). so this may allow us to tail the appropriate log and pick such information up.

Joydeep Sen Sarma
added a comment - 18/Aug/08 22:48 +1.
this looks good enough. i am wondering if u can also log the username somewhere (either the edit log or the namenode log). we are potentially interested in userid's reading specific parts of the file system (and then maintaining last reader information at the directory level). so this may allow us to tail the appropriate log and pick such information up.

I am interested in making some form of archival store in HDFS. Files that are not used for a long time can automatically be moved to slower and/or denser storage. Given the rate at which a cluster size increases, and given the fact that the cost to store data for infinitely long time is very low, it makes sense for the file system to make intelligent storage decisions based on how/when data was accessed. This argues for "access time" to be stored in the file system itself.

HADOOP-3336 can be used to accomplish this to some extent... the separate log that it generates can be periodically merged with the file system image. But, I feel that design is a little awkward and not too elegant.

dhruba borthakur
added a comment - 18/Aug/08 23:42 I agree that HADOOP-3336 does this from an auditing perspective.
I am interested in making some form of archival store in HDFS. Files that are not used for a long time can automatically be moved to slower and/or denser storage. Given the rate at which a cluster size increases, and given the fact that the cost to store data for infinitely long time is very low, it makes sense for the file system to make intelligent storage decisions based on how/when data was accessed. This argues for "access time" to be stored in the file system itself.
HADOOP-3336 can be used to accomplish this to some extent... the separate log that it generates can be periodically merged with the file system image. But, I feel that design is a little awkward and not too elegant.

I think this proposal is in the right direction.
According to HADOOP-3860 name-node can currently perform 20-25 times more opens per second than creates.
Which means that if we let every open / getBlockLocation be logged and flushed we loose big.
Another observation is that map-reduce does a lot of ls operations both for directories and individual files.
I have seen 20,000 per second. This is done when the job starts and depends on the user input data and on how many tasks should the job be running.
So may be we should not log file access for ls, permission checking, etc. I think it would be sufficient to writeOP_SET_ACCESSTIME only in case of getBlockLocations().
Also I think we should not support access time for directories only for regular files.

Another alternative would be to keep the access time only in the name-node memory. Would that be sufficient enough to detect "malicious"
behavior of some users? Name-nodes usually run for months, right? So before say upgrading the name-node or simply every (other) week
administrators may look at files that have never been touched during that period and act accordingly.
My main concern is that even though with Dhruba's approach we will batch access operations and would not loose time on flushing them
directly the journaling traffic will double that is with each flush more bytes need to be flushed. Meaning increased latency for each flush,
and bigger edits files.

It would be good to have some experimental data measuring throughput and latency for getBlockLocation with and without ACCESSTIME
transactions. The easy way to test would be to use NNThroughputBenchmark.

Konstantin Shvachko
added a comment - 19/Aug/08 00:19 I think this proposal is in the right direction.
According to HADOOP-3860 name-node can currently perform 20-25 times more opens per second than creates.
Which means that if we let every open / getBlockLocation be logged and flushed we loose big.
Another observation is that map-reduce does a lot of ls operations both for directories and individual files.
I have seen 20,000 per second. This is done when the job starts and depends on the user input data and on how many tasks should the job be running.
So may be we should not log file access for ls, permission checking, etc. I think it would be sufficient to write
OP_SET_ACCESSTIME only in case of getBlockLocations().
Also I think we should not support access time for directories only for regular files.
Another alternative would be to keep the access time only in the name-node memory. Would that be sufficient enough to detect "malicious"
behavior of some users? Name-nodes usually run for months, right? So before say upgrading the name-node or simply every (other) week
administrators may look at files that have never been touched during that period and act accordingly.
My main concern is that even though with Dhruba's approach we will batch access operations and would not loose time on flushing them
directly the journaling traffic will double that is with each flush more bytes need to be flushed. Meaning increased latency for each flush,
and bigger edits files.
It would be good to have some experimental data measuring throughput and latency for getBlockLocation with and without ACCESSTIME
transactions. The easy way to test would be to use NNThroughputBenchmark.

dhruba borthakur
added a comment - 19/Aug/08 00:31 I agree with Konstantin on most counts. I will probably implement access times for files and directories for getblocklocations RPC only and then run NNThroughputBenchmark to determine its impact.

dhruba borthakur
added a comment - 19/Aug/08 06:55 Yes, Konstantin, I meant "access time for files for the getblocklocations RPC only".
Regarding Joydeep's requirement about recording the user-name of last access,, I agree with Raghu that it is more likely a case for HADOOP-3336 .

If and only if increase in EditsLog data noticeably affects performance (which I suspect it will) : Couple of options :

Provide an accuracy setting : For e.g. if an admin is interested only in accuracy of days, this setting could be 24 hours and an access to a file with 24hours will be recorded only once.

Maintain only the in-memory access times. Since NameNode is not restarted very often, admins could always run a script every week or so to get a snap shot of access times. When the NameNode restarts similar script could update the access times from prev snap-shot.

Raghu Angadi
added a comment - 19/Aug/08 19:05 If and only if increase in EditsLog data noticeably affects performance (which I suspect it will) : Couple of options :
Provide an accuracy setting : For e.g. if an admin is interested only in accuracy of days, this setting could be 24 hours and an access to a file with 24hours will be recorded only once.
Maintain only the in-memory access times. Since NameNode is not restarted very often, admins could always run a script every week or so to get a snap shot of access times. When the NameNode restarts similar script could update the access times from prev snap-shot.

Hi Raghu, the idea you propose "update access times olce every 24 hours" sounds good. However, how will the namenode remember which inodes are dirty and need to be flushed? It can keep a data structure to record all inodes that have "dirty" access times, but it needs memory. Another option would be to "walk" the entire tree looking for "dirty" inodes. Both approaches are not very appealing. Do you have any other options?

dhruba borthakur
added a comment - 19/Aug/08 19:38 Hi Raghu, the idea you propose "update access times olce every 24 hours" sounds good. However, how will the namenode remember which inodes are dirty and need to be flushed? It can keep a data structure to record all inodes that have "dirty" access times, but it needs memory. Another option would be to "walk" the entire tree looking for "dirty" inodes. Both approaches are not very appealing. Do you have any other options?

> However, how will the namenode remember which inodes are dirty and need to be flushed?

It does not need any more structures than what you have proposed ("Add a 4 byte field to the in-memory inode to maintain access time."). So at each access, you check if this 4-byte field is older than "accuracy setting", then you add a EditLog entry. If you want to keep the in-memory accurately (which I don't think is required), then you need to add 4 bytes more to record "last logged time". Right?

Raghu Angadi
added a comment - 19/Aug/08 19:53 > However, how will the namenode remember which inodes are dirty and need to be flushed?
It does not need any more structures than what you have proposed ("Add a 4 byte field to the in-memory inode to maintain access time."). So at each access, you check if this 4-byte field is older than "accuracy setting", then you add a EditLog entry. If you want to keep the in-memory accurately (which I don't think is required), then you need to add 4 bytes more to record "last logged time". Right?

Allen Wittenauer
added a comment - 19/Aug/08 20:04 Until we get volumes or the equivalent, it would be good to have the accuracy setting take a path. For example, I might want to have a more accurate setting for data outside a user's home directory.

Raghu Angadi
added a comment - 19/Aug/08 20:26 Just to clarify once again, the first option above does not require any more memory or tree traversals. It just reduces the number of entries to EditsLog. Its actually just a tweak.

Raghu, just to make sure I understand right: let's say the accuray setting in 24 hours. Now, suppose I read the contents of a file /tmp/foo now at 1PM. The in-memory inode is updated with the accesstime of 1PM. But it is not recorded in the transaction log. let's assue, that no other files are accessed in the file system for the entire next day.

When it is 1 PM tomorrow, the system has to remember that /tmp/foo needs to be flushed. How does this occur? How does hdfs find out that the inode /tmp/foo is dirty and has to be flushed to the transaction log?

dhruba borthakur
added a comment - 19/Aug/08 21:49 Raghu, just to make sure I understand right: let's say the accuray setting in 24 hours. Now, suppose I read the contents of a file /tmp/foo now at 1PM. The in-memory inode is updated with the accesstime of 1PM. But it is not recorded in the transaction log. let's assue, that no other files are accessed in the file system for the entire next day.
When it is 1 PM tomorrow, the system has to remember that /tmp/foo needs to be flushed. How does this occur? How does hdfs find out that the inode /tmp/foo is dirty and has to be flushed to the transaction log?

Raghu Angadi
added a comment - 19/Aug/08 22:01 > The in-memory inode is updated with the accesstime of 1PM. But it is not recorded in the transaction log.
It is recorded in the transaction log at this time (assuming it was not accessed in 24 hours prior to that).
> When it is 1 PM tomorrow, the system has to remember that /tmp/foo needs to be flushed. How does this occur? [...]
It does not need to remember, since the transaction was written at 1 PM previous day.
I am trying to see if I am missing something here. Note that effect of not sync-ing the editslog file for each access is same as before.

Tsz Wo Nicholas Sze
added a comment - 19/Aug/08 22:51 > IOW, a last access time of 't' returned by NameNode for file implies "this file was last accessed during [t, t+24h)".
Basically, it is recording "access date" for the case above.
BTW, is it expensive to invoke System.currentTimeMillis()? I don't have any idea.

1. Implements access time for files. Directories do not have access times.
2. The access times of files is precise upto an hour boundary. This means that all accesses to a file between 2Pm and 2:59PM will have an access time of 2PM.
3. The transaction to write access time to edits is not synced. It is assumed that it will get synced as part of other transactions.
4. There is a configuration parameter to switch off access-times. By default, it is on.
5. There is a new FileSystem API setAccessTime() to set the access time on a file.
6.

dhruba borthakur
added a comment - 22/Aug/08 07:46 This patch does the following:
1. Implements access time for files. Directories do not have access times.
2. The access times of files is precise upto an hour boundary. This means that all accesses to a file between 2Pm and 2:59PM will have an access time of 2PM.
3. The transaction to write access time to edits is not synced. It is assumed that it will get synced as part of other transactions.
4. There is a configuration parameter to switch off access-times. By default, it is on.
5. There is a new FileSystem API setAccessTime() to set the access time on a file.
6.

> 2. The access times of files is precise upto an hour boundary
Can this be made configurable? If not now, I am 100% certain it will be made configurable in near future. I don't see any advantage to not making this a config variable (but probably there is one). In fact we might have a feature request to make it runtime-configurable.. but not required in this jira.

Raghu Angadi
added a comment - 22/Aug/08 16:33 > 2. The access times of files is precise upto an hour boundary
Can this be made configurable? If not now, I am 100% certain it will be made configurable in near future. I don't see any advantage to not making this a config variable (but probably there is one). In fact we might have a feature request to make it runtime-configurable.. but not required in this jira.

I agree we should better make INode.accessTimePrecision a configuration parameter. In any case I did not expect it to be an extra field in INode (not even static!!).

If we introduce hdfs.access.time.precision as a config parameter, then you do not need dfs.support.accessTime because you can either set hdfs.access.time.precision to a big (unreachable) number or pick a special value say "0" to indicate the atime should not be supported.

FileSystem.setAccessTime() and related changes to client classes including ClientProtocol seem to be redundant. We already have such method, just call getBlockLocations().

In FSDirectory.setAccessTime() I would modify atime in memory every time it's requested. This is free once you already have the INode. Only calling of logAccessTime() should be done once an hour.

Konstantin Shvachko
added a comment - 23/Aug/08 00:44 A few preliminary comments.
I agree we should better make INode.accessTimePrecision a configuration parameter. In any case I did not expect it to be an extra field in INode (not even static!!).
If we introduce hdfs.access.time.precision as a config parameter, then you do not need dfs.support.accessTime because you can either set hdfs.access.time.precision to a big (unreachable) number or pick a special value say "0" to indicate the atime should not be supported.
FileSystem.setAccessTime() and related changes to client classes including ClientProtocol seem to be redundant. We already have such method, just call getBlockLocations() .
In FSDirectory.setAccessTime() I would modify atime in memory every time it's requested. This is free once you already have the INode. Only calling of logAccessTime() should be done once an hour.

Incorporated most review comments. I do not update the in-memory access time every time. The in-memory access time is in sync with the value persisted on disk. Otherwise, the access time of a file could move back in time when a namenode restarts!

I also ran benchmarks with NNThroughputBenchmark. All benchmarks remain at practically the same performance. In particular, the "open benchmark with 300 threads and 100K files" is as follows:

dhruba borthakur
added a comment - 25/Aug/08 11:23 Incorporated most review comments. I do not update the in-memory access time every time. The in-memory access time is in sync with the value persisted on disk. Otherwise, the access time of a file could move back in time when a namenode restarts!
I also ran benchmarks with NNThroughputBenchmark. All benchmarks remain at practically the same performance. In particular, the "open benchmark with 300 threads and 100K files" is as follows:
patch trunk
----------------------------
59916 59865 ops/sec
59171 59191 ops/sec

I strongly disagree with introduction of FileSystem.setAccessTime().
File systems usually do not provide such an interface.
Access time is changed to current when you access a file (call getBlockLocations() in our case), but the clients should not be able to set atime. It is not secure. Users can mistakenly set it to some time in the past or maliciously to a future value and screw the system up.

white space added.

FSNamesystem. It is better to replace boolean member supportAccessTime by a method

boolean isAccessTimeSupported() {
return accessTimePrecision > 0;
}

INode. Please revert changes to this line

+ /** * Get last modification time of inode.

FSDirectory. The name-node will print the message below every time you access the file.

Konstantin Shvachko
added a comment - 27/Aug/08 02:18
Test data -1-2% looks good.
FSConstants
accessTimePrecision is not a constant and is not used anywhere.
white space added.
FileSystem
I strongly disagree with introduction of FileSystem.setAccessTime() .
File systems usually do not provide such an interface.
Access time is changed to current when you access a file (call getBlockLocations() in our case), but the clients should not be able to set atime. It is not secure. Users can mistakenly set it to some time in the past or maliciously to a future value and screw the system up.
white space added.
FSNamesystem. It is better to replace boolean member supportAccessTime by a method
boolean isAccessTimeSupported() {
return accessTimePrecision > 0;
}
INode. Please revert changes to this line
+ /** * Get last modification time of inode.
FSDirectory. The name-node will print the message below every time you access the file.
NameNode.stateChangeLog.info( "File " + src +
" already has an precise access time."
This is going to be a problem for log watchers, and will affect the performance.
Is it just a debug message? I'd remove it.
I thought that ls should print aTime. Did not find any references to FileStatus.getAccessTime() in shell files.

Incorporated all of Konstantin's review comments other than the one that said that setAccessTime() call should be removed.

The setAccessTime() call is a utility method that allows an application to set the access time of a file without having to "open" the file. The permission-access-checks are precisely the same as that for "opening" a file, so there isn't any security concern IMO. Most file systems support setting access times/modification times on a file, see http://linux.die.net/man/2/utimes.

I purposely did not add a command to the FsShell to display access times. An application can fetch the accessTime using a programmatic API FileSystem.getFileStatus().

dhruba borthakur
added a comment - 27/Aug/08 22:38 Incorporated all of Konstantin's review comments other than the one that said that setAccessTime() call should be removed.
The setAccessTime() call is a utility method that allows an application to set the access time of a file without having to "open" the file. The permission-access-checks are precisely the same as that for "opening" a file, so there isn't any security concern IMO. Most file systems support setting access times/modification times on a file, see http://linux.die.net/man/2/utimes .
I purposely did not add a command to the FsShell to display access times. An application can fetch the accessTime using a programmatic API FileSystem.getFileStatus().

I don't understand. There must be some secret use case that you don't want to talk about or something. I have all those questions

Why do we need to be able to setAccessTime() to Au 31, 1991 or Jul 15, 2036?

Why do we need setAccessTime() but never needed setModificationTime()?

My understanding is that access time main use case is for ops to be able to recognize files that have not been used for the last 6 month and remove them on the bases they are old. So a user can loose files if he by mistake sets aTime to e.g. 1b.c. Or alternately a user can set aTime to files 1 year in advance and that will keep ops from removing them for the next 1.5 years.

More. Local file system, KFS, S3 all will not support setAccessTime(), but HDFS will. Is it right to make it a generic FileSystem interface?

Pointing to utime() is the same as pointing to FSNamesystem.unprotectedSetAccessTime(). I still cannot change aTime or mTime using bash.

I guess I am saying I am ok with a touchAC() method (as in touch -ac), but it is already there, called getBlockLocations(), and I don't see why you need more.
The rest looks great.

Konstantin Shvachko
added a comment - 28/Aug/08 00:08 - edited I don't understand. There must be some secret use case that you don't want to talk about or something. I have all those questions
Why do we need to be able to setAccessTime() to Au 31, 1991 or Jul 15, 2036?
Why do we need setAccessTime() but never needed setModificationTime()?
My understanding is that access time main use case is for ops to be able to recognize files that have not been used for the last 6 month and remove them on the bases they are old. So a user can loose files if he by mistake sets aTime to e.g. 1b.c. Or alternately a user can set aTime to files 1 year in advance and that will keep ops from removing them for the next 1.5 years.
More. Local file system, KFS, S3 all will not support setAccessTime(), but HDFS will. Is it right to make it a generic FileSystem interface?
Pointing to utime() is the same as pointing to FSNamesystem.unprotectedSetAccessTime(). I still cannot change aTime or mTime using bash.
I guess I am saying I am ok with a touchAC() method (as in touch -ac ), but it is already there, called getBlockLocations(), and I don't see why you need more.
The rest looks great.

I can see where the capability to set access time would be extremely useful for the FUSE case. (and the NFSv4 proxy case, should that come to pass)

+1 it would be nice as "touch /export/hdfs/foo" seems to be a common way of checking fuse-dfs and the returned IO Error is confusing since the module is working but just can't implement touch. I guess it could be a no op but when access time is configured, it would be nice to have.

Pete Wyckoff
added a comment - 28/Aug/08 00:24 I can see where the capability to set access time would be extremely useful for the FUSE case. (and the NFSv4 proxy case, should that come to pass)
+1 it would be nice as "touch /export/hdfs/foo" seems to be a common way of checking fuse-dfs and the returned IO Error is confusing since the module is working but just can't implement touch. I guess it could be a no op but when access time is configured, it would be nice to have.

Raghu Angadi
added a comment - 28/Aug/08 00:30 - edited > it would be nice as "touch /export/hdfs/foo"
Isn't touch just 'append(f); close(f)'? That what command line touch seems to do.
Edit: I am just talking about touch. no strong opinion about w.r.t. setModificationTime() and setAccessTime() .

Pete Wyckoff
added a comment - 28/Aug/08 01:07 in fuse-dfs, this is the error I get:
touch /export/hdfs/user/pwyckoff/foo
touch: setting times of `/export/hdfs/user/pwyckoff/foo': Function not implemented
So, I assumed one needs to implement some attribute setting function. But, this is against 0.17 so appends also give me an IO Error.

touch() is getBlockLocations() in terms of hdfs.
Preserving file times while copying or tar-ing is useful, no doubt about it, but is rather different from being able to set it to an arbitrary value.
setAccessTime() gives more than what is needed.

Konstantin Shvachko
added a comment - 28/Aug/08 01:57 touch() is getBlockLocations() in terms of hdfs.
Preserving file times while copying or tar-ing is useful, no doubt about it, but is rather different from being able to set it to an arbitrary value.
setAccessTime() gives more than what is needed.

Hi Konstantin, I am not fixated on providing the setAcessTime method. However, I think it is a powerful feature that can be used by any hierarchical storage system. It even has precedence on all Unix and Windows systems :

dhruba borthakur
added a comment - 28/Aug/08 02:46 Hi Konstantin, I am not fixated on providing the setAcessTime method. However, I think it is a powerful feature that can be used by any hierarchical storage system. It even has precedence on all Unix and Windows systems :
http://linux.die.net/man/2/utimes
If more people feel strongly about not having the setAccessTime API, I will remove it from the patch.

+1 for setAccessTime() in HDFS.
-1 for adding it to FileSystem API. As we move towards stable APIs we really need to be more cautious about adding new APIs. Once there is enough utility demonstrated by this API in HDFS, it can move up (perhaps in a different form).

regd touch, its main function is to change modification time (and create the file if it does not exist).

Raghu Angadi
added a comment - 28/Aug/08 03:17 - edited +1 for setAccessTime() in HDFS.
-1 for adding it to FileSystem API. As we move towards stable APIs we really need to be more cautious about adding new APIs. Once there is enough utility demonstrated by this API in HDFS, it can move up (perhaps in a different form).
regd touch, its main function is to change modification time (and create the file if it does not exist).

Konstantin Shvachko
added a comment - 28/Aug/08 19:18 - edited This is what I get from the discussion above.
touchAC() method is useful and can find immediate application in fuse.
for archival and (remote) copy purposes we will need a flag that would create new files with time (and may be other) attributes having THE SAME value as the source files.
there is no reasonable use case for setAcessTime method at the moment. I mean nobody needs to set aTime to yesterday or tomorrow unless it is the time of the file being copied or untared.
In the spirit of minimizing impact on external APIs I propose to
introduce FileSystem.touchAC() and implement it in HDFS as a call to getBlockLocations().
not introduce setAcessTime() anywhere in hdfs client code including ClientProtocol.
We can always introduce setAcessTime() when it will really be necessary. Because once introduced its hard to take it back.

+1 to not unduly adding stuff to the API, but when the thing in mind is something needed to support a posix API, shouldn't there be an exception? We already know as Allen points out that setAccessTime is needed to implement tar properly, so isn't that an important enough application?

Pete Wyckoff
added a comment - 28/Aug/08 19:36 +1 to not unduly adding stuff to the API, but when the thing in mind is something needed to support a posix API, shouldn't there be an exception? We already know as Allen points out that setAccessTime is needed to implement tar properly, so isn't that an important enough application?

I still vote for having FileSystem.setAccessTime(). It is very much like Posix and I do not see a security issue anywhere at all. Also, most archiving/unarchiving utilities sets the access time. For example, an restore utility will need to set the access time of the file.

dhruba borthakur
added a comment - 28/Aug/08 19:57 I still vote for having FileSystem.setAccessTime(). It is very much like Posix and I do not see a security issue anywhere at all. Also, most archiving/unarchiving utilities sets the access time. For example, an restore utility will need to set the access time of the file.

As I said before setAccessTime has wider semantics than it is required for tar .
setAccessTime lets you set an arbitrary value to aTime, while in tar you just need to replicate the time of an existing file.
On the other hand setting only access time is not sufficient for tar. You also want to keep the same modification time, owner, group and permissions. So to me it makes more sense to introduce a new create method with FileStatus of the existing file as a parameter or a copy method with an option to replicate FileStatus fields.

Konstantin Shvachko
added a comment - 28/Aug/08 20:10 As I said before setAccessTime has wider semantics than it is required for tar .
setAccessTime lets you set an arbitrary value to aTime, while in tar you just need to replicate the time of an existing file.
On the other hand setting only access time is not sufficient for tar . You also want to keep the same modification time, owner, group and permissions. So to me it makes more sense to introduce a new create method with FileStatus of the existing file as a parameter or a copy method with an option to replicate FileStatus fields.

Konstantin Shvachko
added a comment - 28/Aug/08 20:19 Posix is not an argument in connection with hdfs.
I always find it hard to argue with my kind when after all pros and cons they say " I still want it".
What restore utility?

If we want to provide POSIX like functionality when ever possible (which I think is good idea), it would make more sense to keep the names/parameter etc similar as well. Though not part of this jira, I am +1 for FileSystem.utimes() and -1 for FileSystem.setAccessTime().

Raghu Angadi
added a comment - 28/Aug/08 21:40 If we want to provide POSIX like functionality when ever possible (which I think is good idea), it would make more sense to keep the names/parameter etc similar as well. Though not part of this jira, I am +1 for FileSystem.utimes() and -1 for FileSystem.setAccessTime().

Sorry I created a confusion. When I talked about touch() I meant changing access time to current, but Raghu just corrected me that touch() in posix it can also change modification time. Just to clarify I did not want to say anything about modification time, everything I talked about was access times. Precisely, I meant

touch -ac foo

I'll replace touch() with touchAC() in my earlier posts to clarify that if anybody will be curious enough to read it.

Konstantin Shvachko
added a comment - 28/Aug/08 22:01 Sorry I created a confusion. When I talked about touch() I meant changing access time to current, but Raghu just corrected me that touch() in posix it can also change modification time. Just to clarify I did not want to say anything about modification time, everything I talked about was access times. Precisely, I meant
touch -ac foo
I'll replace touch() with touchAC() in my earlier posts to clarify that if anybody will be curious enough to read it.

Allen Wittenauer
added a comment - 28/Aug/08 22:37 Maybe this is a dumb question but how will hadoop archives (htars?) interact with access times?
Also, at least to me, an API called touchAC() seems very non-obvious as to its purpose. (esp if I'm doing this on Windows)

I like Raghu's proposal that FileSystem.setAccessTime() can be renamed as FileSystem.utimes(FileStatus). But it creates some other issues:

1. The FileStatus object has blockSize of the file. The blockSize cannot be changed. Similarly, the FileStatus object has a field called 'isdir". What happens to this one?

2. Similarly, the FileStatus has the length of the file. Are we going to truncate the file (or create a sparse file with holes if the user sets a longer length)?

3. There are existing APIs FileSystem.setReplication(), FileSystem.setOwner(), setGroup(), setPermissions(). etc. Will these be deprecated or coexist with the new API?

I prefer adding a setAccessTime because it allows an application to set the access time to an arbitrary value. If we want to merge all the above APIs into FileSystem.utimes(), I can do it as part of a separate JIRA.

dhruba borthakur
added a comment - 28/Aug/08 23:54 - edited I like Raghu's proposal that FileSystem.setAccessTime() can be renamed as FileSystem.utimes(FileStatus). But it creates some other issues:
1. The FileStatus object has blockSize of the file. The blockSize cannot be changed. Similarly, the FileStatus object has a field called 'isdir". What happens to this one?
2. Similarly, the FileStatus has the length of the file. Are we going to truncate the file (or create a sparse file with holes if the user sets a longer length)?
3. There are existing APIs FileSystem.setReplication(), FileSystem.setOwner(), setGroup(), setPermissions(). etc. Will these be deprecated or coexist with the new API?
I prefer adding a setAccessTime because it allows an application to set the access time to an arbitrary value. If we want to merge all the above APIs into FileSystem.utimes(), I can do it as part of a separate JIRA.
Raghu, Konstanin: does it sound ok?

I don't know how FileStatus, blockSize etc matters. I would have thought it would be something like FileSystem.utimes(path, modTime, aTime) and keep the behaviour as close to as possible to posix man page, just like FileSystem.setAccessTime() interface you added.

I suggested the name utimes() since you gave utimes() as the justification. If you think setAccessTime() should be the name, that s alright I guess.

Raghu Angadi
added a comment - 29/Aug/08 00:40 - edited I don't know how FileStatus, blockSize etc matters. I would have thought it would be something like FileSystem.utimes(path, modTime, aTime) and keep the behaviour as close to as possible to posix man page, just like FileSystem.setAccessTime() interface you added.
I suggested the name utimes() since you gave utimes() as the justification. If you think setAccessTime() should be the name, that s alright I guess.
The API stuff could better done in a different jira IMHO.
Edit : minor

dhruba borthakur
added a comment - 29/Aug/08 00:44 Hi Raghu, Thanks. I like your proposal of having
FileSystem.utimes(path, modTime, aTime).
I can do this as part of this JIRA. Further cleanup of the API can be done as part of another JIRA. Do we have consensus now?

Raghu Angadi
added a comment - 29/Aug/08 01:08 +1 for me.. in this jira or a different one, does not matter. sticking to posix-like when possible helps since that interface has already gone through the arguments..
Actually I wanted to edit my comment make it clear that I am not too opposed to setAccessTime()...

Konstantin Shvachko
added a comment - 29/Aug/08 01:11 > setAccessTime because it allows an application to set the access time to an arbitrary value.
This is exactly the reason why I am against introducing setAccessTime. No arbitrary value to access time. And therefore no utimes() for me.
I do not oppose to set aTime to current.

Ok, from Raghu's and Konstantin's comments, I guess nobody is stuck on how the API looks like. Whether it is setAccessTime() or utimes(), everybody is ok with it. It appears that both Raghu and Konstantin are +1 on this one. Please let me know if this is not the case.

The point that is being discussed is whether utimes/setAccessTime allows setting the time to any user-specified value or whether it sets it to current time on namenode. I still vote for allowing an user to set any access time... this is what POSIX does and it allows restore utilities to use a standard API. From Raghu's comments, it appears to me that he is +1 on it too.

>So to me it makes more sense to introduce a new create method with FileStatus of the existing file as a parameter or a
> copy method with an option to replicate FileStatus fields.

dhruba borthakur
added a comment - 29/Aug/08 01:22 Ok, from Raghu's and Konstantin's comments, I guess nobody is stuck on how the API looks like. Whether it is setAccessTime() or utimes(), everybody is ok with it. It appears that both Raghu and Konstantin are +1 on this one. Please let me know if this is not the case.
The point that is being discussed is whether utimes/setAccessTime allows setting the time to any user-specified value or whether it sets it to current time on namenode. I still vote for allowing an user to set any access time... this is what POSIX does and it allows restore utilities to use a standard API. From Raghu's comments, it appears to me that he is +1 on it too.
>So to me it makes more sense to introduce a new create method with FileStatus of the existing file as a parameter or a
> copy method with an option to replicate FileStatus fields.
I do not like the idea of having a custom API as described above.

Konstantin Shvachko
added a comment - 29/Aug/08 01:49 Clarifying.
> setAccessTime() or utimes(), everybody is ok with it. It appears that both Raghu and Konstantin are +1 on this one.
-1 on both.
> I still vote for allowing an user to set any access time.
-1.
Seriously, it's like nobody is listening to others. May be we need a meeting.

Just to add noise to the fire, I'm +1 to setAccessTime. I also think it is a very good idea to be able to configure it off at the namenode. My case for setAccessTime is that if you expand an archive or do distcp, it is really nice to be able to optionally set all of the times to match the copied files. That includes access time.

Owen O'Malley
added a comment - 03/Sep/08 04:50 Just to add noise to the fire, I'm +1 to setAccessTime. I also think it is a very good idea to be able to configure it off at the namenode. My case for setAccessTime is that if you expand an archive or do distcp, it is really nice to be able to optionally set all of the times to match the copied files. That includes access time.

I think Konstantin's point is that the only "realistic" use case we can come up with for setting an arbitrary access time is during a file create (such as expansion of an archive and a distcp), therefore the API should reflect that use case since it keeps the number of routines small. Please correct me if I'm wrong.

To me, that sounds incredibly unclean. It would better to allow for a separate utimes()-equivalent API so that a) if there is a use case later, it can be covered and b) do we really want a call that is two operations-in-one?

Allen Wittenauer
added a comment - 03/Sep/08 18:40 I think Konstantin's point is that the only "realistic" use case we can come up with for setting an arbitrary access time is during a file create (such as expansion of an archive and a distcp), therefore the API should reflect that use case since it keeps the number of routines small. Please correct me if I'm wrong.
To me, that sounds incredibly unclean. It would better to allow for a separate utimes()-equivalent API so that a) if there is a use case later, it can be covered and b) do we really want a call that is two operations-in-one?

During create is not enough even for these use cases. Say distcp copies 10GB file and sets Mod time at create time (to t - 1month), and the last block is written 1 min later.. then the mod time after Distcp will be (t + 1min) rather than (t - 1month). -1 for extra options to create, or close, etc.

Why not just provide utimes().. since we are using POSIX as a tie breaker?

Another Konstantin's point is that FS should not allow setting future time.. which sounds ok.. but it is just a file attribute to help users not something filesystem inherently depends upon. I don't see need to police it that much .. and since POSIX is a tie breaker we could just stick to it functionality. Note that all the use cases we need to be able to set modtime too.

Raghu Angadi
added a comment - 03/Sep/08 18:54 - edited > a file create (such as expansion of an archive and a distcp), [...]
During create is not enough even for these use cases. Say distcp copies 10GB file and sets Mod time at create time (to t - 1month), and the last block is written 1 min later.. then the mod time after Distcp will be (t + 1min) rather than (t - 1month). -1 for extra options to create, or close, etc.
Why not just provide utimes().. since we are using POSIX as a tie breaker?
Another Konstantin's point is that FS should not allow setting future time.. which sounds ok.. but it is just a file attribute to help users not something filesystem inherently depends upon. I don't see need to police it that much .. and since POSIX is a tie breaker we could just stick to it functionality. Note that all the use cases we need to be able to set modtime too.

Given that Raghu, Owen and Allen commented that it is better to follow the POSIX semantics of allowing an user to set either access time or modification time to any arbitrary value he/she likes, I change my earlier patch sightly to add the following API:

/**

Set access time of a file

@param p The path

@param mtime Set the modification time of this file.

The number of milliseconds since Jan 1, 1970.

A value of -1 means that this call should not set modification time.

@param atime Set the access time of this file.

The number of milliseconds since Jan 1, 1970.

A value of -1 means that this call should not set access time.
*/
public void setTimes(Path p, long mtime, long atime
) throws IOException;

This is precisely similar to the POSIX utimes call, but follows the Hadoop naming pattern for method names. This allows setting access time or modification time or both.

dhruba borthakur
added a comment - 03/Sep/08 22:32 Given that Raghu, Owen and Allen commented that it is better to follow the POSIX semantics of allowing an user to set either access time or modification time to any arbitrary value he/she likes, I change my earlier patch sightly to add the following API:
/**
Set access time of a file
@param p The path
@param mtime Set the modification time of this file.
The number of milliseconds since Jan 1, 1970.
A value of -1 means that this call should not set modification time.
@param atime Set the access time of this file.
The number of milliseconds since Jan 1, 1970.
A value of -1 means that this call should not set access time.
*/
public void setTimes(Path p, long mtime, long atime
) throws IOException;
This is precisely similar to the POSIX utimes call, but follows the Hadoop naming pattern for method names. This allows setting access time or modification time or both.

From my understanding, (and this is what I have implemented in this patch), a read access is required to be able to set access time on files. A write-access is required to be able to set modification time on files.

dhruba borthakur
added a comment - 03/Sep/08 23:45 From my understanding, (and this is what I have implemented in this patch), a read access is required to be able to set access time on files. A write-access is required to be able to set modification time on files.

Another option would be to allow changing access times and modifications times by the owner of the file and the superuser. But this patch does not do this. This patch "a read access is required to be able to set access time on files. A write-access is required to be able to set modification time on files".

dhruba borthakur
added a comment - 04/Sep/08 06:28 Another option would be to allow changing access times and modifications times by the owner of the file and the superuser. But this patch does not do this. This patch "a read access is required to be able to set access time on files. A write-access is required to be able to set modification time on files".

The main use cases are distcp, restore (or untar).
Konstantine raises 2 good points:

restrict to create operation. In order to make this work the time has to be applied at close event otherwise you run into the situation that Ragu raises about the file taking a long time to write its last block.

restrict times to be <= the NN's current time,
This could run into problems with distcp betweens two hdfs clusters with clocks out sync,

While the extended create operation works for our use case, there are few advantages to the utimes() approach:

handles other use cases we haven't thought of today

if we provide partial posix compatibility in the future, one could use posix's restore/untar tools

Sanjay Radia
added a comment - 04/Sep/08 06:33 The main use cases are distcp, restore (or untar).
Konstantine raises 2 good points:
restrict to create operation. In order to make this work the time has to be applied at close event otherwise you run into the situation that Ragu raises about the file taking a long time to write its last block.
restrict times to be <= the NN's current time,
This could run into problems with distcp betweens two hdfs clusters with clocks out sync,
While the extended create operation works for our use case, there are few advantages to the utimes() approach:
handles other use cases we haven't thought of today
if we provide partial posix compatibility in the future, one could use posix's restore/untar tools
Hence I am in favour of:
FileSystem.utimes(path, modTime, aTime).
+1

Regd setTimes() implementation : We should have a private setTimes that does not do security checks and audit logging since most common use is internal (as in getBlockLocations()) . Security checks and logging is needed only when user actively invokes setTimes().. btw, should it be setUTimes()? I haven't looked at rest of the patch thoroughly.

Raghu Angadi
added a comment - 04/Sep/08 17:33 - edited Edit : its fine. getBlockLocations calls internal dir.setTimes().
Regd setTimes() implementation : We should have a private setTimes that does not do security checks and audit logging since most common use is internal (as in getBlockLocations()) . Security checks and logging is needed only when user actively invokes setTimes().. btw, should it be setUTimes()? I haven't looked at rest of the patch thoroughly.

Hi Raghu, thanks for reviewing this patch. the current patch does not do any adsitional security checks or audit logging while settign access times when invoked from getBlockLocations. In the case when FileSystem.setTimes() is called, it checks access priviledges and does audit logging. So, it behaves precisely the way you described in your comment. Please let me know if you have any additional comments.

dhruba borthakur
added a comment - 04/Sep/08 18:56 Hi Raghu, thanks for reviewing this patch. the current patch does not do any adsitional security checks or audit logging while settign access times when invoked from getBlockLocations. In the case when FileSystem.setTimes() is called, it checks access priviledges and does audit logging. So, it behaves precisely the way you described in your comment. Please let me know if you have any additional comments.

dhruba borthakur
added a comment - 05/Sep/08 06:58 I created HADOOP-4077 to debate on the access permissions that are requires to invoke the setTimes API.
The failed test is TestKosmosFileSystem and it has been failing for the last 5 builds. This failure is not part of this patch.
The findbugs warnings are not introduced by this patch. I believe that the test-patch process is getting confused while diffing the findbugs outout on trunk with the findbugs output from this patch.
Konstantin has reviewed this patch earlier. Please let me know if somebody else wants to review this patch. I would to get it commited by Friday Sept 5 so that it can make it into the 0.19 release.