Since both of those subsystems have journalled crash recovery, as far as I can see, this should not have occurred if ordinary assumptions are being met by the host machine (e.g. this would not occur on real hardware). Is VirtualBox disrespecting the flush semantics required by ext3fs, InnoDB, (and probably any transactional system such as reiser3fs)?

i.e. I am hypothesising that commits/flushes are being requested, but VB isn't requiring them to be made synchronously, hence an unexpected OS X shutdown was violating assumptions in the filesystem/RDBMS design that would hold on real hardware. (The MySQL corruption was inconsistent metadata between .FRM and ibdata(?) which indicates to me that an I/O promise is being violated.)

This seems a very fundamental bug in VB, and would affect any installation hoping for virtualised filesystem/database integrity. Can anyone comment?

At point 4, ZFS (or any host filesystem) cannot be guaranteed to have committed any part of 1 or 3, or even that 1 and 3 were written in order. If VB were issuing a synchronous HOST flush at 2 then the ordering would be what the client filesystem/DBMS is expecting in order to maintain integrity.

The value [x] that selects the disk is 0 for the master device on the first channel, 1 for the slave device on the first channel, 2 for the master device on the second channel or 3 for the master device on the second channel. Only disks support this configuration option. It must not be set for CD-ROM drives.

citation service provided free of charge by

aquarius

Last edited by aquarius on 14. Mar 2009, 11:14, edited 1 time in total.

This thread still serves a purpose: I would seriously question whether ignoring flushes should be the default, as it's almost guaranteed to eventually corrupt databases and filesystems on any client if the host suffers a sudden interruption (as they always do, eventually, even if it's a power out - laptop battery; kernel panic; even colos have power outages).

Furthermore, this corruption will be unexpected to people who assume choosing journaled systems (ext3, reiser3, InnoDB, etc) would prevent this kind of corruption.

VB with flushes disabled is much less safe than real hardware in that a power loss on the latter won't corrupt, but a power loss on a VB host will corrupt clients.

Right, ZFS would only help WRT HFS+ consistency, and if vb was honoring flushes.

Thanks for the post on the preference. This would be a really great option to expose in the GUI since the odds of corruption are high with it off in many scenarios. If I were setting up a machine and saw the checkbox 'Honor guest disk sync requests' I'd always turn it on. You're right the manual is the right place to look to figure out why flushes aren't working, but the user is only likely to go looking for it after having experienced corruption. I'm assuming most VB users don't read the manual front-to-back before jumping in. That may be unwise, but I suspect the characterization applies to 95%+ of users, so where corruption is likely showing it in the GUI might help out a large class of users.

I fully agree with qu1j0t3 and bill_mcgonigle that reading manual chapter 11.1.3 in the troubleshooting section in a post mortem situation must appear like a bad joke to the faithful user. Especially if she/he has taken care of enabling all sorts of file system logging and db transaction cautions.

You need to read the source for src/VBox/Devices/Storage/DevATA.cpp and DevBlock.cpp. I don't think that the IDE flush operation does what you think. According to this:

Support for periodically flushing the VDI to disk. This may prove useful for those nasty problems with the ultra-slow host filesystems. If this is enabled, it can be configured via the CFGM key "VBoxInternal/Devices/piix3ide/0/LUN#<x>/Config/FlushInterval". <x> must be replaced with the correct LUN number of the disk that should do the periodic flushes. The value of the key is the number of bytes written between flushes. A value of 0 (the default) denotes no flushes.

Enable support for ignoring VDI flush requests. This can be useful for filesystems that show bad guest IDE write performance (especially with Windows guests). NOTE that this does not disable the flushes caused by the periodic flush cache feature above. If this feature is enabled, it can be configured via the CFGM key "VBoxInternal/Devices/piix3ide/0/LUN#<x>/Config/IgnoreFlush". <x> must be replaced with the correct LUN number of the disk that should ignore flush requests. The value of the key is a boolean. The default is to ignore flushes, i.e. true.

If supported, the ATA flush command requests an hdd interface pfnFlush method which is then mapped in a VDI to the VDIcore.cpp function VDIFlushImage which then flushes the VDI block index to disk and calls the low level RTFileFlush function which then calls the corresponding routine in src/VBox/Runtime/r3/posix/fileio-posix.cpp or src/VBox/Runtime/r3/win/fileio-win.cpp rountines. The former does a fsync and the latter a FlushFileBuffers.

This all relies on the guest OS initiating an ATA flush when its file systems receive a sync request. Note that under normal circumstances the VB VMM runtime simply call a stdio write (in the case of *nix) or a WriteFile in the case of Windows. Hence there is no guarantee that the block order assumed by the guest OS file system will be preserved by the host OS and in general it will implement elevator optimisation algorithms in writing to disk and therefore frustrate any ordering assumptions.

I haven't physically traced the code paths in ext3, MySQL, etc, to see what happens all the way down the stack. Just that what I've observed is an apparent violation of barrier semantics that is not going to occur on real hardware (absent a hardware fault or kernel bug). We know that the client subsystem code (e.g. ext3 or InnoDB) uses flushes/barriers at appropriate times (or these systems would not be reliable on real hardware either).

I infer from your message that there may be a second problem in that even when a flush is seen by VB, it may not be indicated to the host O/S in a way that will force ordering either. Obviously each host has different methods of making that happen.

The bottom line is that data integrity in the virtual client is extremely fragile unless those issue(s) are addressed, preferably by default.

[Quote is a useful tool for picking out specific points in earlier posts. Please avoid using it to quote the entire body of the immediately previous post adds nothing and just makes it more difficult for other to follow the thread.]

You should have qualified your statement "the periodic flush is of no interest at all" to you in this current dialogue. It is of useful context to the wider audience which is why I included it. You're also wrong BTW as I discuss later.

One of the benefits of the FLOSS approach is that the source is open to enable every to see for themselves what is going on. You can either download it as a complete set via the Downloads page, or the sources are also web accessible through the svn browser, so the current version of fileio-posix.cpp is available here, etc. As far as I can see IO RTL uses the internal flag RTFILE_O_WRITE_THROUGH. This is forced set by setting the environment variable VBOX_DISABLE_HOST_DISK_CACHE (see init.cpp). However in the case of posix I/O this is mapped onto O_SYNC for write-through filesystems, and the kernel support for this varies. Google O_SYNC O_DIRECT for more discussions on this.

The whole issue of assured integrity is very problematic on PC platforms. PC hardware and server hardware (esp. within a top class data-centre) are very different. Over half the cost of a top class data-centre is in the power and environmentals with N+1 or N+2 redundancy in every subsystem. These are designed so that the H/W subsystems preserve or safely flush data to permanent storage in a lossless fashion. If you are running your VM on Mac hardware then you are kidding yourself if you think that you have such end-to-end assurance.

The biggest source of failure is sudden loss of power in which case due to the nature of powerdown on IDE and SATA devices there is a reasonable chance of I/O corruption and even hard sector corruption. The [PC type] hardware does not detect loss of power and perform orderly shut-down of CPU and devices in the milliseconds left before the capacitors discharge. So do you use UPS for example?

The second biggest source of failure is OS or driver hang either causing kernel panic, or loss of access to the system in which case the normal recourse is to the power switch (which is why periodic flush is useful: this may still be running so at least your system has flushed all data to the devices in the time you take to decide to hit that reset.)

When the Linux kernel panics, I just terminates without passing control to the drivers and file systems to try to maximise integrity. This is why you should run any "production" VMs on a stable and clean server based system that you are messing around with.

So this issue of integrity on PC class systems isn't an absolute; it is one of probabilities. Using a synchronised file system within a VM or just bare typically more than halves disk performance, so the issue for most people is a trade off. If because of factored probabilities the MTBF is 2-5 years, do they want to halve I/O performance to move this out to 5-10 years, or is it just easier to spend $50-200 on a UPS ?

Still this is a debate that is worth having because there is no single correct answer here.

2) I see nothing in your response respondent to my main point: Is a flush/barrier request from the client O/S mapped to the correct host flush, or not? (Combined, of course, with flushes required by intermediate layers.) If not, that's fatal to data integrity and would introduce a failure mode generally absent from real hardware.

3) Does the IgnoreFlush setting affect that behaviour or not? If so, I question the default being ignore. If not, how can we fix the problem that we observed? And what caused it?

4) all four of your points about "PC reliability" highlight the reason WHY the flush is done in the first place. It is well known (in particular by Sun engineers) that some disk firmware out there does not respect flushes, and obviously in this case integrity cannot be assured by software. But in all other cases, sudden interruptions will NOT compromise integrity where a) the subsystem is designed with features such as journaling which will cleanly recover from sudden interruptions (InnoDB, ext3fs, reiser3fs, etc, etc) and b) the subsystem is able to achieve synchronous flushes as part of that implementation.

So we have come full circle to my hypothesis, that VB is breaking that assumption, possibly causing the data corruption we observed. This is NOT a failure that can be blamed on "PC hardware"; it's a software failure - or, in the case of my friend's corruption and if VB is indeed making a best effort to pass the flush to the host (far from certain), then perhaps it is faulty drive firmware.

(Minor corrections - a journaled filesystem does not halve performance, in fact it's typically faster (in the case of reiser3 versus ext2, substantially faster). They have been (ext3, reiser3, journaled HFS+) the default on Linux and OS X for many years, and is now the default on OpenSolaris (ZFS), so it's hard to argue that people are avoiding them for performance reasons! Also the issue I raised affects integrity of every transactional system, including RDBMSs. And the idea that a data centre does not have power outages is fantasy - *every* colo I've used, no matter how many backup generators and batteries they had, has had unscheduled power outages. It is always going to happen. Which is why ignoring flushes is really dumb...)

Sigh. Toby, I am sorry, but I post on this forum to help users. I am not a Sun employee, so I do this pro-bono. I am trying to help here, not only for you but for others who might be interested in the dialogue. So I would really appreciate it if you would adopt a less combative response, and stop implicitly demanding that I answer Qs where you could easily find out yourself by following the links and hints that I offer.

Aquarius has already given you the syntax of the VBoxManage setextradata command to set the IgnoreFlush parameter to false. Have you tried this yet? Have you bothered to look at the source code as I have done and pointed you at? Have you bothered configuring statistics to track the flush operations?

BTW, I was talking about any filesystem mounted with the sync option an thereby ensuring the any client app (in this case the VMM) and therefore the guest OS maintains I/O order.