Oracle Data Warehouse Design and Architecture

Menu

Testing a No-statistics Environment.

I am absolutely sick and tired of dealing with issues where misleading partition statistics (for example) cause ETL queries to go crazy and try to sort-merge tables with tens of millions of rows each.

Since we’re about to run a test load on production data I’m going to drop all table and index statistics and lock them. OK, actually I’m going to back the statistics up first, and then I’m going to drop and lock them.

Post navigation

16 thoughts on “Testing a No-statistics Environment.”

I agree, big ETL with “wrong” statistics can be a real pain! On an old 9.2 DW I spent longer maintaining statistics than I did physically loading the thing!

But if you using the partition statistics doesn’t that imply you are using just one partition – – so perhaps good old partition exchange is a possibility; trivial for insert into an empty partition, slightly more challenging (like inserting the whole of the target partition into the exchange table before updating it and exchanging back into the partitioned table…)

There are some interesting issues surrounding partitioning in this system. The system is actually completely reloaded every day, rather than adding new data on a periodic cycle, and that made sense when it was a smaller temporary solution, less sense now that it’s loading hundreds of millions of rows every day. The partitioning is arranged so that multiple updates and inserts can run against the tables at the same time, so it’s really an enabler of manual parallelism.

It’s a bit odd, frankly. One of those systems with masny issues that you’d refactor out of the ETL if the business was willing to dedicate development days to it.

I’ll be interested in your results. I had a quick squizz at some of the internals of dynamic sampling. On 10.2.0.3 it samples 32 blocks ( referred to as the default block size ). Although I’ve yet to find the time to investigate whether this is the same for all dynamic sampling settings, and confirm via 10046 how the block sampling occurs ( multi block file I/O? ).

Regardless, I’m interested to see whether this works well in an ETL environment. I’d expect that this could be a big performance hit, if for example your ETL uses literals in SQL ( it might do this to make use of histograms for example ), and therefore has to parse lots.

According to documented behaviour we’d be sampling 64 blocks for unanalyzed tables at level 4, which ought to take in the order of two tenths of a second. Based on that I’m very willing to go higher, to levels 5 or 6, as the overwhelming majority of ETL queries take multiple tens of seconds to execute. I doubt that it’s multiblock io — in fact i hope it isn’t as that would call the randomness into question on large tabes, particularly date-partitioned ones.

I guess that those defaults are estimates really — the sampling is implemented with a SAMPLE clause, which is not really deterministic I think. I wouldn’t be suprised if in fact the 64 blocks was in some cases 58 and in others 69 — not tested that though.

I have the same issue. We partition by month. When the first of the month load happens, Oracle thinks there are “0” rows in the table. After it is loaded, when the batch process kicks off, it does a nested loop whereas it used to do a hash join the previous month. The hash join with undreds of millions of rows finishes in an hour or so, the nested loop join takes 19 hours! I tried deleting and locking stats on the partition and it didn’t work. I tried importing stats from another partition, it didn’t work. I tried setting the partition stats by numrows and numblocks and it didn’t work. I tried increasing the high value, and it didn’t work. I am just going to force the ETL developer to gather partition stats on the first of the month after he loads the table. I don’t know what else to try.

One other interesting issue to look at is the optimizer mode — I found in my testing that an optimizer mode of “CHOOSE” gave me in some cases an RBO-based execution plan even in 10g, hence a bias towards nested loop joins. Now in the context of a partitioned table that caveat may not apply — you’re going to get CBO no matter what you do.

Did you try dynamic sampling for that? I honestly cannot recall a case where it failed to produce a god plan.

I could “HINT” dynamice sampling but I was trying to avoid it in order to let the CBO make the best decision. I deleted and locked stats on the partition and dynamic sampling was not used. Could it be the high value that causes the CBO to think this won’t return any rows since the date they want is greater than what I “think” the high value is?

This suggests that the SAMPLE clause can be asked to look at a specific number of blocks, I’m guessing based on the setting of the param to which I referred. The doco also has some notes on the restriction of access paths that might be used ( includes FTS and IFFS as a possibility).

It also describes a mechanism to attempt to allow the DB to use the same sample from one execution to the next ( not clear what this means when the data may have changed, maybe it just accesses the same blocks, if available? )

All of which just adds some background, and the detail may just be academic.

Curious to find out what the behaviour is on your system, and any insights you might uncover as to how this all works.

I see the dynamic sampling hint differently to many others that you might indeed leave alone in order to let the optimizer make its own choice. By specifying a higher level of sampling than the system default you’re giving permission to the optimizer to spend more time and use more resources specifically in order to make a better decision. That’s not a choice that the optimizer is going to make right otherwise.

Whether it is the high value in the global statistics that cause the incorrect plan or the partition statistics that say “no rows in here” depends on whether the optimizer knows at parse time which single partition the rows are going to be in.

If you’re refering to the BLOCK clause I believe that just changes the mechanism of identifying which data to sample — you are specifying the percentage of blocks to sample instead of the percentage of rows (or the probability of an individual row or block being sampled, depending on your POV), which may be appropriate and more efficient under some circumstances.