Parallel Execution Challenge – It’s About Time

8072012

July 8, 2012

As I was reading a book, I saw a comment essentially stating that with multiple CPUs/cores, parallel execution will be faster than serial execution of the same query. The book provided a test case – I decided to change around the test case a bit so that the results were a bit more fair, and then I performed a bit of testing using the sample database data that Oracle Corporation provides.

My testing was performed in Oracle Database 11.2.0.1 on 64 bit Windows with a quad core CPU that supports hyperthreading, and two SSD drives in a RAID 0 array (with all scripts executed on the server). So, is parallel query always faster than serial execution when the server has multiple CPUs/cores? But wait, that is not the interesting question… not the point of this parallel execution challenge, but still an interesting question. Here is the test script, which will be executed three times, with the results of the first execution thrown out so that we may eliminate the side effects of the hard parse:

The above calculation is easy to understand, and is explained in this article. The SORT ORDER BY operation in the parallel query execution plan added 31 to the calculated cost (308 – 277), while the same operation in the serial execution added 6,138 to the calculated cost (6637 – 499).

The first challenge is to describe why the calculated cost of the parallel execution plan is so small compared to the serial execution.

—

The timing output indicated that the serial execution required 3.13 fewer seconds than the parallel execution. Let’s look at the results of the third execution of this test case script:

The above shows the results of the parallel execution. 918,843 rows selected (not displayed in SQL*Plus due to the AUTOTRACE setting), 1,647 blocks read from disk, 1,802 consistent gets, an estimated cost of 308, and completed in 10.66 seconds. Notice that with the exception of the elapsed time, the statistics are identical to what we saw during the previous execution of the test script.

The above shows the results of the serial execution. 918,843 rows selected (not displayed in SQL*Plus due to the AUTOTRACE setting), 1,647 blocks read from disk, 1,718 consistent gets, an estimated cost of 6,637, and completed in 8.48 seconds. Notice that with the exception of the elapsed time, the statistics are identical to what we saw during the previous execution of the test script.

The second challenge is to describe why the serial execution completed at least 25% faster than the parallel execution of the same query – this result is counter to what the calculated cost of the execution plans suggest, and counter to what the book claimed.

—

Maybe we need to look at the result from a different angle. The book did not use SET TIMING ON as my test case above did, so let’s change the test case again. This time, we will use the GATHER_PLAN_STATISTICS hint so that the runtime engine will collect the timing and various other statistics, and then use DBMS_XPLAN to display the actual execution plan. To eliminate the effects of hard parsing, the script will be executed three times with the first execution thrown out. The new script:

The last of the above SQL statements simply retrieves the SQL_ID and CHILD_NUMBER for the two previous queries. For my results, I received the following output from the last of the above SQL statements:

Notice that the execution plan claims that the query completed in 0.70 seconds, that the calculated cost of the execution plan is (still) 308, that 28 blocks were read from disk, and 116 consistent gets were performed.

Notice that the execution plan claims that the query completed in 0.93 seconds, that the calculated cost of the execution plan is (still) 6,637, that 1,647 blocks were read from disk, and 1,718 consistent gets were performed.

The third challenge is to describe why DBMS_XPLAN claims that the parallel execution completed roughly 32.9% faster than the serial execution, when the timing output in SQL*Plus indicated that the reverse was true.Bonus: Why do the Buffers and Reads statistics in the top row of the execution plan for the serial execution plan exactly match the autotrace generated consistent gets and physical reads statistics, yet the same is not true for the execution plan for the parallel execution? Double Bonus: the execution plan claims that the query completed in 0.93 seconds, yet the timing information in SQL*Plus indicated that the query required an additional 7.55 seconds, why is there so much difference between the reported times (keep in mind that all testing was performed on the server, and the output was discarded without formatting and without displaying on screen)?

—

If cost is time, and time is money, what is the exchange rate for parallel execution? 😉

—

I have not fully worked the answers to the above challenge questions, and I will wait a couple of days before providing any feedback regarding proposed answers supplied by readers. I strongly suspect that some readers know without a doubt the answers to all of the above questions.

Some of my review notes from that book related to parallel execution follow:

* Various details of parallel processing (query, DML, and DDL) are described. (pages 489-513)
* “Parallel DML statements are disabled by default (be careful, this is the opposite of parallel queries).” Use a command such as “ALTER SESSION ENABLE PARALLEL DML” or “ALTER SESSION FORCE PARALLEL DML PARALLEL 4” to specify the default degree of DML parallelism. (page 503)
* The book states: “Parallel DDL statements are enabled by default.” (page 505)
* Disabling parallel execution at the session level does not disable parallel execution for recursive SQL statements. (page 509)
* Parallel execution should not be enabled unless there is sufficient CPU/memory/IO bandwidth available and the query requires at least a minute to execute. “It is important to stress that if these two conditions are not met, the performance could decrease instead of increase.” (page 509)
* Regarding the various parallel related hints, the book states: “It is very important to understand that the hints parallel and parallel_index do not force the query optimizer to use parallel processing. Instead, they override the degree of parallelism defined at the table or index level.” (page 509)

You should be able to find the “Troubleshooting Oracle Performance” book in the online Google books site, and read a couple of pages to see if the book it is worth your money to buy the book.

The parallel execution hints (and some of the parallel execution behavior) changed a bit in Oracle Database 11.2, so you might want to seek out articles that are specific to 11.2.0.1 and above, if that is the release version of Oracle Database that you are using.

The book that I am in the process of finalizing the second half of the review for only has a couple of pages on the topic of parallel execution – that book used a table level parallel hint, while my test case above used a statement level parallel hint.

Thanks for all these links Charles, much appreciated. I also enjoy very much your chalenges because they are simply very helpfull to learn again and again. regarding above challenges, I think i got the answers, but i prefer to think about it twice before posting something that would simply be stupide 🙂

Interesting? Remember that the PARALLEL hint does not force parallel execution, it only changes plan costing. Silly thought – could simply adding an ORDER BY clause cause a query to switch from serial execution to parallel execution? 😉

—

Let’s process the trace files with TKPROF. First, the large trace file containing “serial” in its name, just the section related to the test SQL statement:

The above “query” value of 116 matches the “Buffers” value found in the DBMS_XPLAN output for the parallel execution, and the “disk” value matches the “Reads” value found in the DBMS_XPLAN output. If you take another look at the DBMS_XPLAN output, you will see that the “Starts” column value for plan line IDs 2 through 7 is 0 (indicating that line in the plan was never executed), and the “A-Rows” column for those same plan line IDs is 0, just as the above execution plan shows. The “A-Time” value in the DBMS_XPLAN output does not exactly match the “time=” value shown above – of course enabling a trace is sufficient to increase the execution time, but the above indicates that the plan required 0.54 seconds, while the DBMS_XPLAN output suggests 0.70 seconds. I wonder if there is a clue here?

You might have noticed that I stated “the large trace file containing ‘parallel’ in its name” above. Let’s process the other trace files containing “parallel” in their names.
p003:

I do not know if these MOS (Metalink) documents apply, but I thought that the documents appeared to be interesting:
Bug 13040301: “CBO CHOOSE PARALLEL EXECUTION DUE CHEAPER COST BUT SERIAL IS RUNNING FASTER” – costing problem, related to views and partitioned tables
Bug 12339245: “OPTIMIZER IMPROPERLY USING PARALLELISM FOR QUERIES” – probably does not apply, makes no mention of cost

First problem I see with this query is fact that data distribution between parallel servers is totally incorrect. p001 receives almost whole dataset from both parallel servers performing table scans (900K rows), and almost whole sorting is performed by this server, while second sorting PX server (p000) sorts only 17K rows. This parallel (not-so-parallel?) sort operation, which is the most time-consuming operation of this query, won’t be performed faster than sort in serial execution. I suspect that PX coordination overhead is not so significant to make whole parallel query slower, so probably there are more issues here.

I was thinking something similar. I thought that people running other Oracle Database release versions might determine if they see the same behavior (a table level PARALLEL hint will be required if using 11.1.0.7 or below). The fish have not had any bites yet, so I tested Oracle Database 11.2.0.3 with patch 7 (released less than a month ago) with a fresh copy of the sample database schema.

In the sample database, I hardcoded the system statistics and left all parameters at the default (SGA_TARGET at 6000M, PGA_AGGREGATE_TARGET at 2000M):

the trace files claim not to use the same sorting mechanism v1 vs v2. but as I have forgoten most
of what I read about this on Jonthan Lewis blog I cannot jude if this makes a difference.

pure speculation, not tested anywhere, two theories about optimizer costing

1) There is something I never really understood in these and similar plans (maybe I should start reading some serious books)
If parallel slaves are (pre)sorting the data, is there not another sort or merge needed once they different sources are combined?
Is maybe only the merge/additional sort costed in the parallel case?

or

2) could it be that the optimizer thinks, the lower number of rows to sort per slave fits into memory,
whereas serial execution needs to spill to temp table space? hence more IO higher cost

I also wondered the same thing about the GATHER_PLAN_STATISTICS hint – maybe that is why the actual runtime statistics are not updated in the execution plan. I then recalled that the GATHER_PLAN_STATISTICS hint is roughly equivalent to setting STATISTICS_LEVEL=ALL in regard to execution actual runtime statistics, but the hint has a much lower performance impacting overhead than does changing STATISTICS_LEVEL from the default of TYPICAL to ALL.

SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR('5p47nbddfn0k7',0,'ALLSTATS LAST +COST'));
SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR('8k6bdqr09xc8u',0,'ALLSTATS LAST +COST'));
SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR('1vzjwgsacny45',0,'ALLSTATS LAST +COST'));

The execution plan for the parallel execution still does not include the runtime statistics for the parallel slaves:

Still to be determined is why the serial execution was costed with a sort area size of 1MB, while the parallel execution was costed with a sort area size of 40MB (the value of the _PGA_MAX_SIZE parameter as it appeared in the trace files) – that would be the likely explanation for the costing difference of the sort operation.

The comments in the article that you linked to above cover the third challenge of this blog article very well; your summarization of the comments from the article is well stated.

I did not have a good explanation for the behavior – I assumed that the query was executed once, not once per parallel server process. Thinking out loud, by what value will the EXECUTIONS and PX_SERVERS_EXECUTIONS columns of V$SQL (http://docs.oracle.com/cd/E14072_01/server.112/e10820/dynviews_3042.htm ) and V$SQLAREA (http://docs.oracle.com/cd/E11882_01/server.112/e17110/dynviews_3064.htm ) be increased when a query is executed in parallel: 1) The values in those columns will be increased by a value of 1 for each execution by the query coordinator; 2) The values in those columns will be increased by a value of 1 per parallel server process (and quey coordinator) involved in the query. Time for another test case script…

Hints for Posting Code Sections in Comments

********************
When the spacing of text in a comment section is important for readability (execution plans, PL/SQL blocks, SQL, SQL*Plus output, etc.) please use a <pre> tag before the code section and a </pre> tag after the code section:
<pre>

SQL> SELECT
2 SYSDATE TODAY
3 FROM
4 DUAL;
TODAY
---------
01-MAR-12

</pre>
********************
When posting test case samples, it is much easier for people to reproduce the test case when the SQL*Plus line prefixes are not included - if possible, please remove those line prefixes. This:

SELECT
SYSDATE TODAY
FROM
DUAL;

Is easier to execute in a test case script than this:

SQL> SELECT
2 SYSDATE TODAY
3 FROM
4 DUAL;

********************
Greater than and Less than signs in code sections are often interpretted as HTML formatting commands. Please replace these characters in the code sections with the HTML equivalents for these characters: