Oracle Blog

We you buy an Exadata machine, you will probably use Oracle Could Control to manage the system and all the databases you are going to have inside.

In order to use it, you need to discover the Exadata Rack in Cloud Control.

The best approach to do this is to make a prerequisites validation using exadataDiscoveryPreCheck.pl script.

I have a customer trying to discover an Exadata Rack in Cloud Control 13c. Process was failing because an issue trying to validate the infiniband switches.

Trying to execute the prerequisites validation using exadataDiscoveryPreCheck.pl script, we found this issue:

Verifying Infiniband Switch version...
--------------------------------------
Verifying version for enk-iba1.us infiniband switch...
Could not invoke command version using SSH ===> Not ok
* Please check the password and host status.
Additionally please check that SSH is not blocked by a firewall.
Verifying version for enk-ibb1.us infiniband switch...
Could not invoke command version using SSH ===> Not ok
* Please check the password and host status.
Additionally please check that SSH is not blocked by a firewall.

When we tried to execute the version command manually from this host, it seems to work correctly

Parallel execution is an Oracle feature that allows one process to split the job it needs to do in several subprocesses. This job, implies I/O and CPU operations.
This feature is controlled by several parameters at instance level and the DEGREE value at segments definition level.
By default, including 12.1.0, Oracle has defined parameter PARALLEL_DEGREE_POLICY = MANUAL and by default all segments are created with Degree of Parallelism (DoP) set to 1.
It means that, Oracle will execute all SQL in serial, by default

When Parallel Execution is needed, a specific DoP can be manually set using three different ways:

1.- Set a fixed DoP at segment level:

ALTER TABLES SALES PARALLEL 6;

In this case, queries accessing the SALES table via full scan use a requested DoP of 6.
2.-Use PARALLEL hint in SQL

But the question is, when do I need to execute a sentence in parallel?
The answer always depends on the environment you are playing in.

Parallelism is the idea of breaking down a task so that, instead of one process doing all of the work in a query, many processes do part of the work at the same time. This work can be broken down in two different types: I/O work and CPU work.

Here is when EXADATA environments are different. When Full Scan is used in EXADATA, Smart Scan goes into action, using all cell nodes and doing all I/O in parallel.
So, does it has sense to use parallel when the work for a SQL happen at I/O?
The answer for this is no, because there is a cost associated with coordinating the parallel execution servers and the cost of this coordination may outweigh the benefits of parallelism.
So, under this premise, in EXADATA, parallelism has sense when the amount of CPU used by the SQL statement is high enough to get benefit from it.
This implies than the better option for enabling parallelism is using the PARALLEL hint into your SQL and use it only when it is really needed.

One last point, Tanel Poder explains here that above three options will not FORCE the parallel execution, but rather just reduces optimizer cost estimates for full scans.

The name of this parameter always makes me think that if it is set, the system will be protected against lost writes, so it will prevent occurring.

A data block lost write occurs when an I/O subsystem acknowledges the completion of the block write, while in fact the write did not occur in the persistent storage, but knowing that I/O subsystem is not part of the Oracle Database code, you may wonder, “how Oracle is able to protect against a lost write when it is not responsible for writing“. Well, in my opinion, the name of this parameter is misguided, because it is not preventing lost writes.

So why use it? When you have a lost write situation, the sooner you detect it, the better, and this parameter is very useful to detect lost write as soon as possible and will help you to, under some circumstances, be able to repair the lost write and at the end, avoid loosing any data, but it will not protect your system against it.

In order to understand how this parameter works, we need to understand first how redo works.

When a database block change is generated by any DML operation, Oracle makes this change in two ways:

Logical: Oracle makes a description of the change that is going to be performed including in this description which block version is changing (original scn) when the change happens (scn of the change) and what changes are performed (the changes themself). This information is stored in the redo log files and allow Oracle to re-execute the change when a recover process happens.

Physical: The changes are performed in the block itself.

Using the logical description of the change stored in the redo log files, during recover process and before applying the change in the block, Oracle verifies if the block that is going to be modified is the right one, or, to be more precise, if it is from the right version.

Now, what happens if it is not. Lets see an example in a Dataguard configuration: I’m going to simulate a lost write and we can see what happens when a new change for the block that suffered the lost write is applied in standby database.

First, we identify a row and we take a copy of the block before applying any change

This means that because of the way the software is designed, lost writes detection is implicit in the code, but it is done only when a new change is performed on the block that suffered the issue.

So, what is the difference when DB_LOST_WRITE_PROTECT is set?

Well, the difference is that when it is set, every SELECT statement will generate and small redo change saying “I’ve read this block and it is this version” and will propagate this information to Standby Databases allowing it to detect the lost write when the block is read. If this behavior happens, you can restore and recover the block in Primary database and this will fix the lost write.

Lets see it in action. We repeat exactly the same steps as before but now after setting the parameter to typical.

DBAs are always concerned about the performance impact of features like db_block_checksum or db_block_checking. Cary Millsap told me an story that several years ago, he met a lot of people who had turned off redo logging with an underscore parameter. The performance they’d get from doing this would set the expectation level in their mind, which would cause them to resist (strenuously!) any notion of switching this [now horribly expensive] logging back on. Of course, it makes you wish that it weren’t even a parameter.

Cary believes that the right analysis is to think clearly about risk. Risk is a “soft” word in most people’s minds, but in finance they teach that risk is quantifiable as a probability distribution. For example, you know every day that it’s possible for a disk drive in your system to go bad. The risk of that happening can actually be calculated. For disks, it’s not too difficult, because vendors do those calculations (MTTF) for us. The probability that you’ll wish you had set db_block_checksum enabled today is probably more difficult to compute.

From a psychology perspective, DBAs would be happier if their systems were set to full or typical values to begin with. Then in response to the question, “Would you like us to remove your safety net in exchange for going between 1% and 10% faster? Here’s the horror you might face if we do it…” I’d wager that most people would say no, thank you. They will react emotionally to the idea of their safety net being taken away.

But with the baseline of its being turned off to begin with, the question is “Would you like us to install a safety net in exchange for slowing your system down between 1% and 10%? Here’s the horror you might face if we don’t…” I’d wager that most people would answer no, thank you (the opposite verdict!), because they will react emotionally to the idea of their performance being taken away. If they had been burned by block corruptions before (which is me case because I worked for corruption team at Oracle Support), then I’m going to favor the safety net argument.

But now the question is: “if I’ve never faced a corruption and I have my disks mirrored and I have a good database backup strategy, why do I need to set these kind of parameters that will impact my performance?” The answer is “Because in other case, we can not be 100% sure that Oracle is reliable to detect the corruption and could provide you wrong information and (the worst) you wont realise!”

By default, Oracle has two internal mechanisms to validate the health of a block from physical structure point of view:

At block tail, Oracle stores some information that must match some information stored at the header of the block. In case these two pieces do not match, the block is reported as FRACTURED when it is validated. ORA-1578 error is reported if this problem is found. This validation cannot be disabled. You can see a good explanation for fractured blocks in MOS.

The other mechanism enabled by default is DB_BLOCK_CHECKSUM. In this case, DBWr and the direct loader calculate a checksum (a number calculated from all the bytes stored in the block) and store it in the cache header of every data block when writing it to disk. Checksums are recalculated when blocks are read and compared with the information at the header. In case they do not match, block is reported as CHECKSUM corrupted and ORA-1578 error is reported at client side.

This DB_BLOCK_CHECKSUM functionality can be disabled in order to improve performance between 1% and 5% for everything but SYSTEM tablespace but, are you able to assume the risk?

Lets try to see an example and simulate a hardware failure in the middle of a block when DB_BLOCK_CHECKSUM is disabled

We identify the rows we have stored in a row, in this case file 5 block 206. This is an 8k block tablespace.

We have lost 7 rows and we have 1 row with no salary…..and there is no error anywhere!!!

In case, if DB_BLOCK_CHECKSUM had been enabled, we had seen ORA-1578 error and we could restore the block from a backup.

In summary, as my colleague Gabriel Alonso (a.k.a Gaby) says, disabling DB_BLOCK_CHECKSUM is like trying to land a plane without any instrumentation at night in the middle of a hurricane – at least risky.