FSDirectory's fsync() is lenient

Details

Description

This method has a lot of problems:
1. it tracks 'stale files' as it writes (this seems pointless), and only actually fsyncs the intersection of that 'stale files' and the filenames passed as argument to sync(). So any bogus names passed to sync() are just silently ignored
2. if "something bad happens" (e.g. two indexwriters/dirs on the same path, or some other shenanigans), and the file is actually in stale files, but was say actually deleted on the filesystem, the underlying fsync() call will create a new 0-byte file and fsync that.

In my opinion we should do none of this. we should throw exceptions when this stuff is wrong.

Hi, because Simon Willnauer asked me: The javadocs of FileChannel#force(boolean) are not so nice, because they only garant that the changes made by this FileDescriptor are written to disk. In fact this is correct and also affects RandomAccessFile, although the javadocs are not so explicit (Javadocs of FileDescriptor class only talk about buffers owned by "this FD", but this is equivalent).

In fact, if you check the native C++ source code of OpenJDK, in fact, FileDescriptor#sync() and FileChannel#force(true) call exactly the same sycall: http://linux.die.net/man/2/fsync

fsync() transfers ("flushes") all modified in-core data of (i.e., modified buffer cache pages for) the file referred to by the file descriptor fd to the disk device (or other permanent storage device) so that all changed information can be retrieved even after the system crashed or was rebooted. This includes writing through or flushing a disk cache if present. The call blocks until the device reports that the transfer has completed. It also flushes metadata information associated with the file (see stat(2)).

Calling fsync() does not necessarily ensure that the entry in the directory containing the file has also reached disk. For that an explicit fsync() on a file descriptor for the directory is also needed.

FileChannel#force(false) just calls fdatasync(), also explained on this man page. On Windows, it does the same like FileChannel#force(true).

Unless other JVMs implement this is a completely different way, both code paths are identical. In addition, the man page of fsync also states:

On some UNIX systems (but not Linux), fd must be a writable file descriptor.

This is why we need to open for write and because of that we can only create 0 byte files in Java 6, unless we check for existence before.

Please also note the following statement:

Calling fsync() does not necessarily ensure that the entry in the directory containing the file has also reached disk. For that an explicit fsync() on a file descriptor for the directory is also needed.

As far as I remember, in Java 7 we can also flush the directory, we should try this - if it works!

Uwe Schindler
added a comment - 09/Apr/14 21:16 - edited Hi, because Simon Willnauer asked me: The javadocs of FileChannel#force(boolean) are not so nice, because they only garant that the changes made by this FileDescriptor are written to disk. In fact this is correct and also affects RandomAccessFile, although the javadocs are not so explicit (Javadocs of FileDescriptor class only talk about buffers owned by "this FD", but this is equivalent).
In fact, if you check the native C++ source code of OpenJDK, in fact, FileDescriptor#sync() and FileChannel#force(true) call exactly the same sycall: http://linux.die.net/man/2/fsync
fsync() transfers ("flushes") all modified in-core data of (i.e., modified buffer cache pages for) the file referred to by the file descriptor fd to the disk device (or other permanent storage device) so that all changed information can be retrieved even after the system crashed or was rebooted. This includes writing through or flushing a disk cache if present. The call blocks until the device reports that the transfer has completed. It also flushes metadata information associated with the file (see stat(2)).
Calling fsync() does not necessarily ensure that the entry in the directory containing the file has also reached disk. For that an explicit fsync() on a file descriptor for the directory is also needed.
FileChannel#force(false) just calls fdatasync() , also explained on this man page. On Windows, it does the same like FileChannel#force(true) .
Unless other JVMs implement this is a completely different way, both code paths are identical. In addition, the man page of fsync also states:
On some UNIX systems (but not Linux), fd must be a writable file descriptor.
This is why we need to open for write and because of that we can only create 0 byte files in Java 6, unless we check for existence before.
Please also note the following statement:
Calling fsync() does not necessarily ensure that the entry in the directory containing the file has also reached disk. For that an explicit fsync() on a file descriptor for the directory is also needed.
As far as I remember, in Java 7 we can also flush the directory, we should try this - if it works!
Addition: On Windows, all code in JNI impls for java.io and java.nio is finally delegated to FlushFileBuffers(HANDLE) : http://msdn.microsoft.com/en-us/library/windows/desktop/aa364439(v=vs.85).aspx . The windows kernel32.dll syscall also requires that the file is open for write.

I think this is worth fixing for 4.7.2; that esoteric case that could bring an empty file into existence won't matter in practice, and the confusion when IW unexpectedly brings 0 byte files into existence is really confusing.

Michael McCandless
added a comment - 04/Apr/14 23:57 I think this is worth fixing for 4.7.2; that esoteric case that could bring an empty file into existence won't matter in practice, and the confusion when IW unexpectedly brings 0 byte files into existence is really confusing.

Ok, I am fine with waiting for 4.8 . But this best effort patch is fine imo for the record. I don't want to do any sneakiness to meet some theoretical perfection for java 6. This is just a very practical thing, especially if you are the one scratching your head at said zero byte files.

Robert Muir
added a comment - 04/Apr/14 23:42 Ok, I am fine with waiting for 4.8 . But this best effort patch is fine imo for the record. I don't want to do any sneakiness to meet some theoretical perfection for java 6. This is just a very practical thing, especially if you are the one scratching your head at said zero byte files.
Lets leave 4.7 branch totally broken instead. Fine by me.

The whole method isnt atomic anyway, it takes Collection of files and must do them one by one

This was not meant to be atomic for al files. The issue here is: It is still possible to create an empty file: if the RAF readonly open is done successfully, then another thread deletes the file, before the RandomAccessFile is opened for write.

I am not sure if we should backport this hack to 4.7 branch. We should better release Lucene 4.8 on Java 7 and leave 4.7 as it is. The bug is not causing data corrumption, it just confuses if you are debugging strange things like issue LUCENE-5574.

Uwe Schindler
added a comment - 04/Apr/14 23:08 The whole method isnt atomic anyway, it takes Collection of files and must do them one by one
This was not meant to be atomic for al files. The issue here is: It is still possible to create an empty file: if the RAF readonly open is done successfully, then another thread deletes the file, before the RandomAccessFile is opened for write.
I am not sure if we should backport this hack to 4.7 branch. We should better release Lucene 4.8 on Java 7 and leave 4.7 as it is. The bug is not causing data corrumption, it just confuses if you are debugging strange things like issue LUCENE-5574 .

Robert Muir
added a comment - 04/Apr/14 19:23
I think the non-atomicness is fine here.
Ok ill make a 4.7.2 patch. The whole method isnt atomic anyway, it takes Collection of files and must do them one by one

Michael McCandless
added a comment - 04/Apr/14 19:22 In my opinion, the sync() should be done by the IndexOutput (e.g. before closing). But thats another issue to solve.
+1, we are on thin ice today. See LUCENE-3237 for some earlier discussion about this.
Also, we should sync the directory as well...

One more addition: I am not sure if the whole workflow, Lucene uses to force syncing are actually working on all operating systems like we have them. The javadocs of FleChannel and FileDescriptor only say that all changes made with this file descriptor resp. this FileChannel are written to disk:

If this channel's file resides on a local storage device then when this method returns it is guaranteed that all changes made to the file since this channel was created, or since this method was last invoked, will have been written to that device. This is useful for ensuring that critical information is not lost in the event of a system crash.

This method is only guaranteed to force changes that were made to this channel's file via the methods defined in this class. It may or may not force changes that were made by modifying the content of a mapped byte buffer obtained by invoking the map method. Invoking the force method of the mapped byte buffer will force changes made to the buffer's content to be written.

In my opinion, the sync() should be done by the IndexOutput (e.g. before closing). But thats another issue to solve.

Uwe Schindler
added a comment - 04/Apr/14 19:14 One more addition: I am not sure if the whole workflow, Lucene uses to force syncing are actually working on all operating systems like we have them. The javadocs of FleChannel and FileDescriptor only say that all changes made with this file descriptor resp. this FileChannel are written to disk:
If this channel's file resides on a local storage device then when this method returns it is guaranteed that all changes made to the file since this channel was created, or since this method was last invoked, will have been written to that device. This is useful for ensuring that critical information is not lost in the event of a system crash.
This method is only guaranteed to force changes that were made to this channel's file via the methods defined in this class. It may or may not force changes that were made by modifying the content of a mapped byte buffer obtained by invoking the map method. Invoking the force method of the mapped byte buffer will force changes made to the buffer's content to be written.
In my opinion, the sync() should be done by the IndexOutput (e.g. before closing). But thats another issue to solve.

Michael McCandless
added a comment - 04/Apr/14 19:14 Patch looks good.
Another approach for Java 6 (4.7.x) would be to use a "non-atomic" check to first try to open the file for read and if that works, close and reopen RW - both as RandomAccessFile).
+1
I think the non-atomicness is fine here.

WRITE does not zero the file, to empty it you need to use TRUNCATE_EXISTING or CREATE.
APPEND is just there to initially set the file pointer.

But in any case, +1 to commit.

We can only use this with Lucene 4.7.x if we use crazy reflection and detect Java 7. Another approach for Java 6 (4.7.x) would be to use a "non-atomic" check to first try to open the file for read and if that works, close and reopen RW - both as RandomAccessFile).

Uwe Schindler
added a comment - 04/Apr/14 19:05 Patch looks fine, only the APPEND is not needed:
FileChannel.open(fullFile.toPath(), StandardOpenOption.WRITE);
WRITE does not zero the file, to empty it you need to use TRUNCATE_EXISTING or CREATE.
APPEND is just there to initially set the file pointer.
But in any case, +1 to commit.
We can only use this with Lucene 4.7.x if we use crazy reflection and detect Java 7. Another approach for Java 6 (4.7.x) would be to use a "non-atomic" check to first try to open the file for read and if that works, close and reopen RW - both as RandomAccessFile).

Robert Muir
added a comment - 04/Apr/14 15:26 Upon researching, I think we should address Uwe's concern. So I think we should change the patch to do:
fFileChannel.open(f, StandardOpenOption.WRITE, StandardOpenOption.APPEND);
But we cannot do this in 4.7.x (needs java7 apis). So if we want to backport we should do something else there.

Robert Muir
added a comment - 03/Apr/14 22:42
I am just a little bit sceptical if a sync() on a readonly file really does what the javadocs say (on all operating systems...).
I don't think thats an issue i need to test for: thats a jvm bug if it doesnt happen according to the javadocs.

after looking on the bug that lead to this issue I'd have appreciated to get a FNF exception rather than 0-byte files to begin with

Which bug?

About the patch: Looks fine. I am just a little bit sceptical if a sync() on a readonly file really does what the javadocs say (on all operating systems...). Can we test this somehow without using the power-switches of our computers? Maybe write a large file, sync it and while doing that check some /proc or /sys file to monitor how many unwritten buffers are there?

Uwe Schindler
added a comment - 03/Apr/14 22:38 after looking on the bug that lead to this issue I'd have appreciated to get a FNF exception rather than 0-byte files to begin with
Which bug?
About the patch: Looks fine. I am just a little bit sceptical if a sync() on a readonly file really does what the javadocs say (on all operating systems...). Can we test this somehow without using the power-switches of our computers? Maybe write a large file, sync it and while doing that check some /proc or /sys file to monitor how many unwritten buffers are there?

Robert Muir
added a comment - 03/Apr/14 22:37 I'll make a followup issue for the stale files map: I think its bogus for a number of reasons, and IW should track this efficiently. But its somewhat unrelated.

Robert Muir
added a comment - 03/Apr/14 22:28 Thats a good point Simon, i think its going to be a pain to deal with this stupid stale files map (to really fix the stupid leniency)
But as a start, we should fix fsync to not create new zero byte files under any condition. Here is a patch for that.

after looking on the bug that lead to this issue I'd have appreciated to get a FNF exception rather than 0-byte files to begin with. The trappieness of sync and the stale files map is one thing which we can fix in a different issue IMO. But the 0-byte files we should get fixed right away. I also thinkg that it might be ok to throw an exception if you wanna sync a file that was not written through this directory?

Simon Willnauer
added a comment - 03/Apr/14 22:04 after looking on the bug that lead to this issue I'd have appreciated to get a FNF exception rather than 0-byte files to begin with. The trappieness of sync and the stale files map is one thing which we can fix in a different issue IMO. But the 0-byte files we should get fixed right away. I also thinkg that it might be ok to throw an exception if you wanna sync a file that was not written through this directory?

Robert Muir
added a comment - 03/Apr/14 20:29 By the way i think we can go back to RAF actually, its fine to just open the file only for read to sync it. This is described in the javadocs for force().
I just want an exception if the file is bogus

This is because IW passes all files referenced by all segments when it commits(), i.e. we push the responsibility of remembering which files are written but not sync'd down to Directory. This used to be IW's responsibility, but we changed that in LUCENE-2328 I think.

I cant see this method being correct unless we also track syncedFiles, which implies an unbounded list!

Robert Muir
added a comment - 03/Apr/14 20:24
This is because IW passes all files referenced by all segments when it commits(), i.e. we push the responsibility of remembering which files are written but not sync'd down to Directory. This used to be IW's responsibility, but we changed that in LUCENE-2328 I think.
I cant see this method being correct unless we also track syncedFiles, which implies an unbounded list!
I think we should move this stuff back into IW...

if "something bad happens" (e.g. two indexwriters/dirs on the same path, or some other shenanigans), and the file is actually in stale files, but was say actually deleted on the filesystem, the underlying fsync() call will create a new 0-byte file and fsync that.

This is truly awful: sync of a non-existent file should not bring a 0-byte file into existence!

it tracks 'stale files' as it writes (this seems pointless), and only actually fsyncs the intersection of that 'stale files' and the filenames passed as argument to sync(). So any bogus names passed to sync() are just silently ignored

This is because IW passes all files referenced by all segments when it commits(), i.e. we push the responsibility of remembering which files are written but not sync'd down to Directory. This used to be IW's responsibility, but we changed that in LUCENE-2328 I think.

Michael McCandless
added a comment - 03/Apr/14 20:22 if "something bad happens" (e.g. two indexwriters/dirs on the same path, or some other shenanigans), and the file is actually in stale files, but was say actually deleted on the filesystem, the underlying fsync() call will create a new 0-byte file and fsync that.
This is truly awful: sync of a non-existent file should not bring a 0-byte file into existence!
it tracks 'stale files' as it writes (this seems pointless), and only actually fsyncs the intersection of that 'stale files' and the filenames passed as argument to sync(). So any bogus names passed to sync() are just silently ignored
This is because IW passes all files referenced by all segments when it commits(), i.e. we push the responsibility of remembering which files are written but not sync'd down to Directory. This used to be IW's responsibility, but we changed that in LUCENE-2328 I think.

Robert Muir
added a comment - 03/Apr/14 20:21 As far as this staleFiles, we can keep it if we need. but it should not be this silly retainAll() call.
we should record syncedFiles or something like that and be more careful.