August 22, 2011

Why is ACID important?

As the updated version of my last post stated, what Jonathan Lewis described as a “truly stunning observation” was uncovered: depending on your point of view, Oracle violates either the “I” in ACID or the “D”. I won’t repeat the technical description of the issue. This is already covered here and here.

The purpose of this blog entry is to explain why this is such a critical issue. There are three main areas I want to touch on (in increasing order of importance):

Let us begin with the human case. You call your broker late at night with instructions: “I hear that Oracle is not ACID. Please sell my 10 million pounds worth of Oracle stock!”. Your broker says, “OK, I have submitted your instructions for processing. It might take a few minutes”. You say…”I’ll hold. I am going on vacation tomorrow and want to make sure that this is done before I leave”. Your broker replies: “Let me query the system to check……oh yes your instructions are definitely in our system. Your transaction to sell Oracle at best will be actioned tomorrow morning”. You feel relieved. However, the system crashed after the broker was sent confirmation that your transaction was processed but before the transaction was on disk. Your request to sell Oracle stock is lost and you return from vacation to find that no record exists of your request.

This sort of scenario has probably happened a few times as a result of Oracle’s lack of ACID. However, nobody would have known why. On the other hand, it will be a very rare event. Even an entire call centre will only make at most a few queries per second and most of these will be for data that was committed long ago. It will be a rare event that sees queried data lost on a crash.

If this sort of situation has ever happened to you it is highly unlikely that Oracle’s lack of ACID was the cause. Bugs in application code, media recovery and human error amongst other causes would account for the vast majority of such mishaps.

However, let us turn our attention to computer to computer communication. Consider the following situation. You process some kind of incoming data and place the data in a staging table. Let us assume that the table is partitioned by hash into 256 segments. You have 256 producer threads that:

– read data from their own partition
– enrich it in a few ways (including adding a primary key generated from a sequence)
– send the enriched data to an AQ queue
– delete the data from the staging table
– commit.
– loop

Meanwhile, you have another group of sessions that read messages from the AQ queue and send data to a remote system that you have no control over. Your consumer threads:

– Destructively read a message from the AQ queue
– Insert the data into the remote system (using the remote systems transaction system)
– Commit the remote transaction
– Commit the local transaction.
– loop

You are a smart guy or gal and realise that there is an issue with your code: the system may crash after the remote transaction has committed and before the local one does. This means that after a recovery the message will be back in AQ and will be sent again. However, there is a unique primary key in the AQ message and the remote system is built in such a way as to detect these “possible duplicates” and discards the message when it is resent.

Wonderful, but because of the failure of ACID it is possible that not only the uncommitted consumers transaction is lost on a crash but also the “committed” producers transaction! Now on recovery, the message would not be in the AQ queue but back in the staging table. Now when the message is resent it will have a brand new primary key and the remote system will not detect the duplicate.

There are lots of systems where the log writer is the bottleneck, perhaps because of replication to a DR site or for other reasons. In these sorts of heavily loaded systems it is virtually guaranteed that at the time of a crash there will be some transactions that are sitting the log buffer waiting to be written out to disk. When automated tasks are processing transactions immediately after that have been committed recovery will create a problem nearly all the time.

What is so bad about this situation is that there is no easy way, short of a complete redesign, to fix the application to workaround this problem as Oracle provides no way for a session (other than the committing one) to wait for the data it is seeing to be on disk. Sure, the consumer threads could wait a few seconds before sending the data to the remote host but this would kill performance.

It is the potentially high rate that consumer threads can read “undurable data” that makes the probability of an error high in these situations.

The final, and most important, area is the potential legal and reputation issues that this topic raises. It will be very difficult for Oracle to claim that this is a documentation error and amend same to say that Oracle is NOT ACID. On the other hand, it is very difficult to see how they can treat this as a simple “bug” as it is a fundamental part of the design of the product and any “bug fix” would seem to have massive performance implications.

a) Is there a special guard in the AQ code to ensure that the commit has been made durable before it can be propagated – we don’t yet know.

b) For the problem you describe, the destination has to go through a number of steps of network traffic in the interval between the commit and the crash on the source system. Although (based on what we know so far) the inconsistency is still logically possible I think you would have to have a very sluggish source system for this to happen. This doesn’t reduce the significance of the problem, but it does have an impact on the probability of its occurrence.

Regarding the potential of a special guard in AQ, you are right of course. On the other hand, AQ is one of a large set of ways producer threads can communicate with consumer threads. They won’t all have guards.

I was definitely not suggesting two-phase-commit. 2PC guarantees that both sides of the transaction commit or neither do. The issue wouldn’t arise (unless Oracle sends a second phase commit message from the transaction coordinator to the other system before the coordinating commit is on disk….surely inconceivable!). What I was suggesting was two independent transactions (because the published API had no support for 2PC) with application logic to deal with the potential discrepancy.

Yes, you are right that network round trips may render the risk of data loss statistically improbable. However, many message queuing systems include a local agent with store-and-forward capabilities. You also need to consider what the cause of the supposed system crash might be. If it is related to the I/O subsystem then the LGWR might hang in the few seconds before the crash whilst network traffic continues unabated.

I once supported a transaction manager and have actually analysed a crash dump showing such a scenario in real-life.

Nevertheless, even I have to admit that these issues don’t arise every day. If they did, the issue would have before been brought to public attention long ago.

On probability: How much this is a real-world problem depends on what kind of issues can trigger this behaviour.

The testcase suspended the LGWR thread/process. What issues would suspend (not kill) the LGWR thread (I don’t know any except malicious code running on the server, but well, that could do anything inlcuding changing the contents of the REDO files and crashing the instance) or what other issues would result in LGWR not being able to write out to the redo file _after a substantial long delay_ (say long enough that another session can get the data in-doubt).
Everything I can think of (hardware I/O error, error on the OS layer) would result in LGWR failing and the instance would be be terminated very quickly.

Marcus, there are many many I/O failures that result in an I/O hanging or being delayed by a substantial amount of time, such as 1 second. In fact, the catalyst for this entire series was inspired by one recent experience I had with just such a faulty I/O subsystem. Cascading SCSI bus resets, single threading at the HBA, network issues across SAN DR links, the list goes on and on. And as I say, some of these issues substantially increase the probability of a crash.

> I was definitely not suggesting two-phase-commit. 2PC guarantees that both sides of the transaction commit or neither do.

You are, of course, right. I was so busy thinking about a “commit issued” transaction being seen from the outside while still being unrecoverable that I lost track of the fact that 2PC would return ONLY after the “log file sync”.

But then, an Oracle response to your scenario would be that you should be using 2PC. SInce the delete queue couldn’t commit until the “log file sync” the insert into queue/delete from stage would have been made durable by (at least) the same write.

“But then, an Oracle response to your scenario would be that you should be using 2PC.”

Well, yes if I was an Oracle PR man that might indeed cross my mind! And if the remote resource manager/API doesn’t support 2PC? Well, replace it with something from Oracle that does of course!

But apart from the likely impracticality of such a change in real life, it is often not possible anyway. Supoose I wrote my code in PL/SQL invoking the non-PL/SQL API through wrapper functions. How would I use 2PC? Even if the remote system was an Oracle database, the only way to get 2PC from PL/SQL is to use database links. DB links have numerous restrictions, such as the inability to call pipelined functions or to use non-default roles.

In the end, this type of response is probably not likely as it would be a tacit admission that 1PC doesn’t work.

However, we are drifting off the point somewhat.

The purpose of this blog is to demonstrate that this is not purely a hypothetical issue. It is a real technical issue and a very real political issue.

Of course, I may be being a bit emotional. When I wrote my log writer, Andreas Reuter personally reviewed the design (well, we discussed it for 5 minutes) and Phil Bernstein (who coined the term serializability) reviewed my code. The rules of ACID were drumbed into me rather religiously, which is why all this has come as a bit of a shock to me!

I wouldn’t want to down-play the significance of the fact that transactions are not guaranteed to be durable – but the example you have given is of an incomplete design anyway as it doesn’t answer the question “what if the failure of the first database corrupts the tail end of the online redo log file ?”

Your design doesn’t allow for incomplete recovery – that is a problem in its own right. Oracle’s offering to this situation is that it can fool you into thinking that complete recovery has occurred when it hasn’t.

Two thoughts have crossed my mind about the political issue. I first raised the issue of the pl/sql commit in loop not making transactions durable more than 10 years ago – and nobody seemed to care very much. Secondly, I find it hard to believe that none of Oracle’s competitors have spent time looking for details like this – something that you found by accident (sort of), but which a dedicated researcher should probably spot very rapidly – why hasn’t the competition found the problem and publicised it before ? (Perhaps they’ve found it, but don’t want to publicise it because there’s a conspiracy of silence on the point – and I never thought I’d become a conspiracy theorist!)

Incomplete recovery in a distributed system will always create inconsistencies no matter what. Even if I use 2PC and then perform incomplete recovery on one or both systems distributed consistency will be lost and there is no way to design an application to prevent the need for reconciliation in these circumstances. On the othe hand, handling “possible duplicates” is a long established way to ensure distributed consistency in the face of a straightfoward database crash.

In passing, I’ve spend years asking the (notionally rhetorical) question – why does Oracle advise against “shutdown abort” as a standard practice if the database is supposed to be able to recover from instance failure. My stock (rhetorical) answer was that the recovery code is the least frequently tested code and therefore a piece of code you don’t want to run frequently.

The correct answer appears to be that recovery won’t necessarily be complete even though the instance will claim that it has recovered completely.

This is my personal opinion and not official Oracle POV in any way, but I think the reason to “advise” against shutdown abort is that it leaves the database in a less stable (ie – inconsistent) state. I can even imagine some legal eagle getting involved in this discussion over some perceived liability issues. Though, to be honest, I’ve never seen a document that says that Oracle advises against shutdown abort….. do you know of such a document Jonathan or might that just be some bit of urban legend that actually does not exist? We all know in the real world shutdown abort is probably the most frequently used way to shutdown the database (I say this with absolutely no facts to back it up with of course).

I would argue your point that recovery code is the least frequently tested code. Some of it is tested with almost every database/instance startup (instance recovery, crash recovery) and if my personal experience is of any measure the recovery code itself (recover database or the RMAN restore/recover commands) get exercised by the user community alone with such frequency that I think this part of the code is probably well tested.

Now…. when it comes to people and understanding how backup and recovery actually works…. THAT is a whole other story. And the stories I could tell.

I believe that the concept of “durability” is a fallacy and perhaps even a dangerous concept that leaves us with our britches down sometimes. Conceive of the most robust, backed up, distributed, stand-byish, hardened, protected, secured infrastructure and I am sure that somewhere there is a single point of failure that can occur and destroy it all. It might be a fat fingered administrator, a script gone mad, a gopher who has a hunger for power lines or an irate employee. Durability is a myth. Oh, you can reduce the exposure all right, but at the end of the day, when the sun decides to expand into a red-giant… durability will be a moot point.🙂

You may not be able to believe this, but I have never shutdown an Oracle instance in any way other than shutdown abort except in cases where I had logging disabled. I always aimed to discover a scenario where issuing a shutdown abort would leave me with a corrupt database. That would be countless shutdowns since early 1990 pre-release Oracle 6. I too never understood the implicit double-speak in the warnings against shutdown abort. But then, anyone who still reads my blog knows that double-speak makes me crabby.