D'oh! Now I know that Samba has to implement CIFS's locking requirements, and it uses Unix advisory locking for that feat. But being advisory (and I double-checked the file system is not mounted with mandatory locking), this doesn't prevent anyone from reading the file directly from the Linux file system, as does the Disk Agent in question. Some strace(1)ing showed it opens regular files with O_NONBLOCK and apparently does other things to explicitly take part in the advisory locking, even though that's highly dubious.

Of course I understand that just backing up files that are written by another process may end up with less-than-crash-consistent files in the backup, but that's a property of Unix file system backups since day one, well known and accounted for by everyone using that OS family. Why is the Linux DA overshooting here? Mandatory locking would be a reason, but not advisory. In the typical case of office files, the chance to actually grab something inconsistent is vanishingly small, given it's only locked for coordination purposes and actual editing happens to a temporary file anyway. I'd rather live with that slight chance of a broken file than with the entire file missing and throwing errors everywhere.

Re: Defensive locking on Linux?

We have similar issues with numerous servers. That was not a big problem when we were running DP5 up to DP6.20, but we upgraded last November to DP8 (now on 8.12) and since we have major issues:

We run inhanced incremental backup with consolidation. If such error is encountered on a server (minor error), the following consolidation completly 'lose' the whole server for the consolidated backup. No error, nothing, but it is gone.

This is a major issue in my eyes, because unless you look really closely at the details of each backup report every day, there is no way to know that your server is gone from the backup. Of course, if you do not pick that up, and have to restore, you are stuffed!

I have a case opened since last year with HP about it, but progress is very slow, if you see what I mean!

I am trying on parallel to remove the cause of the problem but it is not easy as I cannot reproduce the problem.

I have noticed that the errors are always caused by files that are probably opened from a PC using Windows/Samba, but I am not sure if the PC is active (ie user in a different time zone really working on the file) or passive (ie user having the file opened and gone home with PC perhaps in sleep mode).

Any suggestion for a workaround or a way to reproduce it in order to pass the info to HP would be great

Re: Defensive locking on Linux?

I never got any feedback to this question either in the ITRC forum or off-list. I also didn't actively search for a workaround, most of the Samba boxes slowly went away (specifically when backup admins discovered how nicely VSS on a Windows server allows for user self service recoveries and reduces their workload), those that are still running them have accepted it's a Windows thing (locking has been an issue there for ages, it's just no longer true due to VSS again).

The situation is weird and I never really understood why it behaves that way even though locking is clearly not mandatory. I wasn't even aware O_NONBLOCK open(2) would lead to EAGAIN on a non-mand-mounted file system. To me it appears this is a well meant feature specifically implemented to cooperate with Samba locking, but it badly needs a switch to disable it - the locking behavior is not god-given like on Windows, it's under admin control, and I don't like software that tries to take this control away from me.

As to your issue with 8.x consolidation that turns this locking problem from just a nuisance into a data loss nightmare, this sounds really bad. It sometimes feels that certain feature sets that have been implemented some day (and were pushed as the latest and greatest by the sales folks) later fall out of fashion, start to bitrot and nobody really likes the admins like us who come out of the woodwork, declare that they actually used this stuff and are not amused about it breaking. Note I phrased this in a generic way, it's by far not just DP which shows this developmental problem. I would go as far as seeing the foreshadows of a new coming software crisis hidden in there, as the story is repeating all over the place with a lot of different products, FLOSS or proprietary alike. In short, I was fed up with all the issues around the feature set of consolidation, DFMF file libraries, enhanced incrementals and virtual synthetic fulls in an incremental forever chain and eliminated that completely wherever I had used it before. The file libraries are now StoreOnce software stores, the backups are back to classic staggered full/incremental/differential cadences, and the issues are - ahem - different ones. But there are less of them, at least with 7.03 (I have so far just toyed with 8.x, so I hold my breath here - I just read more about issues than ever before).

As for a reproducer, the best idea would probably be a small C program (or even a perl script) that does the minimum file opening and locking which would trigger the behavior as seen from Samba, but without all the complexity. Of course the developers should know why DP does the open(2) dance in this specific way and why it was implemented with these precautions (I assume in true mandatory locking setups, without these precautions, the DA itself would block, which of course is a no-go and has to be prevented). Another way to close in onto the issue would be hacking the Linux kernel and preventing it from returning EAGAIN on O_NONBLOCK opens of regular files when they aren't actually on a mandatory locking file system. This could even be considered a bug in Linux, given an open on such file system would not block (it can only block with mandatory locking engaged) and as such, it should not return EAGAIN in the O_NONBLOCK case either. Given mandatory locking is a traditional unloved step child in the Unix world, a corner case bug like that may have delevoped without anyone ever noticing or (given we did notice it) correctly attributing it to a kernel misbehavior.

Re: Defensive locking on Linux?

Yes, it does not look like an easy fix at the Linux level at this stage. Unfortunely, we cannot replace all the Linux boxes with Windows, as only specific area are accessible via samba (ie for dumping Crystal reports for web pages, etc).

The synthetics incremental forever technology has worked for us for many years. It is a shame it does look abandonned now.

I probably need to look at modernise our solution as you mentioned.

We have requirements to keep 'online' backup up to 1 week on disk for fast restore (but not 7 copies!), and also be able to get a full backup tapes every days out of the building for DR purpose. After the consolidation, a tape copy is performed. That was working so well all these years!

With the StoreOnce software, I imagine I need to purchase addition licenses or/and hardware? Would it replace/be traded with/ the File Library licenses?

In your experience, can we achieve similar outcome with that new solution?

Re: Defensive locking on Linux?

With the StoreOnce software, I imagine I need to purchase addition licenses or/and hardware? Would it replace/be traded with/ the File Library licenses?

In fact, no, and there are no "File Library licenses". Actually, if you have Advanced Backup to Disk (AB2D) licenses that worked for a DFMF File Library in a virtual full backup scheme, you have more than enough to replace that with SO software. Same for the physical disk space. The Dedup really does magic there. For example, I had a customer up their local disk space and AB2D licensing from 4TB to 7TB so their virtual full cycle would last at least 7 days. After converting that to SO, they are now using 1.5TB and I bumped their backlog from 30 days to 8 weeks so as to at least make some use of the space (we also removed a lot of measures we had in place to prevent stuff from being backed up that's not absolutely needed, to safe space in the DFMF FL). So the problems aren't going to come from that vector. But be aware that rehydration performance will absolutely demand as many RAM and spindles as possible, as the copy to tape performance will be the new golden lamb you have to dance around. That's not that new either (DFMF used to fragment a lot), but it's a new quality.

In your experience, can we achieve similar outcome with that new solution?

The biggest issue will be that of how to deal with the requirement of daily fulls on tape for external vaulting or DR. That is really a hard problem, given synthetic fulls were invented exactly for this purpose. Virtual fulls and incremental forever was also invented to lighten the load on your source servers. If you can stand daily fulls at the source, that would be a solution - SO will happily digest them all and dedup them to mostly nothing (you will reach fantastic dedup ratios this way), but the overhead is enormous and the rehydration becomes even more of a factor. If you cannot do daily fulls at the source, you probably have to persist in your HP case until they provide you a true fix for consolidation. You can do consolidation on SO I've been told (when I tried to configure it with 6.21 I ran into a bug and just dumped the entire concept without looking back much, but that should have changed by now), but it has to work at least as well as it used to before one would rely on it again...

Re: Defensive locking on Linux?

I will need to do more reading and probably some testing before switching over. Doing daily full backup is out the question. It brings down the whole network (about 1Tb over a 10Gb lan). I have split the backups between Windows and Linux servers, so it is rarely that I have to run them both on the same night, so it is 'only' 500Gb, but this will carry on growing.

I will probably wait until our server refresh later this year to test without interfering with the live backups.

On a positive note, I might have found a workaround for the Linux issue that is the cause of the problem with DP:

By adding these 2 lines in the smb.conf for the share, there has not been any errors for a week:

oplocks = no level2 oplocks = no

However it might be a coincidence (users on holidays, or closing files for a change!) and I should wait a few more weeks to find out for sure.

Now, with a bit of luck the synthetic solution might be reliable again, touch wood.

Still need to have HP lab to fix the consolidation that crashes after 10/12 days and require a full backup to get us going for another 10/12 days, but at least that one, whilst annoying, is easy to spot and act upon

Re: Defensive locking on Linux?

switching off OpLocks may indeed help (AFAIR it means the SMB server claims to not support them, so clients don't use them, so there would be no locking to the backend FS either). But did you check the typical Office situation of a user trying to open a file already in use by another user (or even themselfes, yay terminal services)? If that doesn't recognize the lock, it may wreak havoc sooner or later. ISTR that OpLocks are an additional locking feature above and beyond standard locking and so I might see a problem here were there is none, but I'd test this at least superficially before it goes kaboom on me ;)