Collecting full statistics involves scanning the base table and performing a sort to compute the number of occurrences for each distinct value. For some users, the time and resources required to adequately collect statistics and keep them up-to-date is a challenge, particularly on large tables. Collecting statistics on a sample of the data reduces the resources required and the time to perform statistics collection. This blog gives you some tips on using sampled stats.

Sampled stats, like full stats, are only collected when an explicit command is issued. That COLLECT STATISTICS command must include the phrase USING SAMPLE. The output of this sampling is kept in the data dictionary, and histograms are built, in the same manner as when full collection is done. The optimizer uses the sampled stats just as though they were collected in the non-sampled manner. So it is important that you vaildate that the output of USING SAMPLE produces just as good plans as do full statistics.

When USING SAMPLE is specified, the system will automatically determine the correct percentage of the data to scan, which by default is 2% of the rows. The percent of the table that is read is carried in a DBS Control parameter called CollectStatsSample. The setting can be increased, and once increased, all statements that collect sampled statistics will use the new percentage value going forward.

For most columns, USING SAMPLE only reads the percent of the table that is specified in that DBS Control field, and skips the rest. It then makes caculations as to how many distinct values the full table has. Teradata 12 uses the original approach to sampling that came out in Teradata V2R5, while Teradata 13 uses a different approach, that I have found to be more accurate. If you decided not to use sampled stats in the earlier releases, because it was less accurate than you liked, give it another try when you get to Teradata 13.

Using the Teradata 12 sampling functionality, the best choices for USING SAMPLE are columns or indexes that are unique or nearly-unique. Nearly-unique refers to columns or combinations of columns which are over 95% unique.

If statistics were collected with USING SAMPLE you will be able to validate that by looking at the histogram that is produced. Below are a subset of the fields contained within Interval zero of a statistics histogram from Teradata 12, with full statistics collected and then with USING SAMPLE collected at 2%. Notice that the “Number of Unqiues” field is similar in both, making USING SAMPLE a good choice for this column. Also notice that the field “Sampled” contains a “1” when sampling was used.

You might want to use USING SAMPLE at 2% for your nearly-unique columns, but collect at a somewhat higher percentage, 20% or 30% for example, for columns with fewer distinct values. If you are not happy having all sampled statistics collected at the same percentage, there is a diagnostic command that allows you to change the sampling percent at the session level. This command will temporarily override what is in DBS control. Be aware, however, if you recollect stats at the table level at a later time, the sampling carried in DBS Control will be used automatically.

DIAGNOSTIC "COLLECTSTATS, SAMPLESIZE=n" ON FOR SESSION;

Whatever number you place where 'n' is, that's your percent sample size. After stats are collected, I suggest you check the histogram in the "sampled percent" field for verification that this percent was used.

Here's a couple of examples I’ve used:

DIAGNOSTIC "COLLECTSTATS, SAMPLESIZE=20" ON FOR SESSION;

DIAGNOSTIC "COLLECTSTATS, SAMPLESIZE=50" ON FOR SESSION;

Whatever percent you collect sampled stats at, it’s a good idea validate that the number of distinct values that sampling produced is approximately the same as the number of distinct values with full statistics collected. You can do this by first collecting full statistics, followed by a HELP STATs command to see the number of distinct values that was produced. Remember that number. Then do your sampled stats collection and look at the HELP STATs output again, and compare the two.

And as with everything having to do with statistics, getting good query plans is the best validation when you make a change.

A unique PI controls data distribution, prevents duplicates, and allows you to access a table directly and get a single row back. PPI serves a different purpose. It allows you to do a partial table scan instead of a full table scan when accessing a subset of a table. This can save significant CPU and I/O. You can read more about the advantages of PPI if you read the PPI orange book.

HOW DOES PPI CAUSE PERFORMANCE DEGRADATION WITH NUPI?

It is only in the case of access by NUPI that a PPI table may contribute to slower response time, if there are a large number of partitions defined. This is due to each partition needing to be probed for a potential row with that value. I have answered this question in detail in the December 2009 Teradata Magazine column that I wrote. Here is the link to that column:

Amazing article Carie, thanks for sharing. I have recently started to use "Diagnostic helpstats on for session;" to know the column recommendations for collecting stats.How useful is that. Also, I have seen people using column level stats collection on the index:- COLLECT STATS tbl_nm INDEX(clm_nm1,clm_nm2);COLLECT STATS tbl_nm COLUMN(clm_nm1,clm_nm2);

Does it really make a diff. I thought collecting once stats on INDEX is collected.One doesnt need to collect at COLUMN level.

Diagnostic HELPSTATs results in a list of all possible statistics for a given request. Often it provides you with more choices than you will benefit from. It is usually not recommended that you apply all those suggestions, but rather selectively choose from the list, the ones that you believe will help you the most.

In terms of collecting on column level collections on indexes columns, it really does not matter.

Collecting statistics on an INDEX versus the COLUMN(s) that make up the index is just a syntax difference. The collection and storage of the stats information is the same, and the impact on optimizer cost estimation is the same.

If you collect while the index exists, Teradata may choose to scan the index rather than the table to do the COLLECT, regardless of whether COLUMN or INDEX. If you collect stats on a set of columns and there is no matching index at the time, dummy rows (with IndexType M) are inserted in DBC.Indexes. But if a matching index exists when you first collect the stats, the information is stored in the existing DBC.Indexes rows - even if you used the COLUMN keyword in the COLLECT.

Article on USING SAMPLE STATS is very useful. BTW, I want to know the difference between FULL STATS and SAMPLE STATS . What is the advantage of FULL STATS when SAMPLE STATS alone enough and beneficial?Thanks - Uma

When full statistics are collected, all rows in the table are read and are input to the stats collection process. When USING SAMPLE is specified, then only a percent of the table's rows are read. The default is 2%. If there is skew in the table, you may get inaccurate results from USING SAMPLE.

In Teradata 12.0, we recommended that USING SAMPLE be used only for unique or nearly unique columns or indexes. Because accuracy was better in those cases.

But in Teradata 13.0 there are enhancements to the sampling algorithm that make it more accurate, so it can be used more widely. However, you still better off using full statistics is skew exists in the column or index.

As I said in the original posting above, "it’s a good idea validate that the number of distinct values that sampling produced is approximately the same as the number of distinct values with full statistics collected." If USING SAMPLE is providing reasonably good statistics, then it makes sense to use it. This will depend on the nature of your data, so you have to do some comparisons, and also consider if the plans that rely on those sampled stats are good.

Carrie - I am just doing some testing on the costs of sample stats versus full stats for a given table. Full stats cost is 5500 CPU secs and sample stats cost only 616 CPU secs so on the face of it a significant saving. However, full stats has a parallel efficiency of 64% whereas sample stats is only 9% meaning Impact CPU time (ie CPU of busiest amp x number of amps) is 10,000 CPU secs (full stats) versus 6,000 CPU secs (sample stats) which is far less of a saving. The relative parallel inefficiency of the sample stats produces less significant performance benefits than you might expect from the Total CPU Times. Is this what you expect and can you explain why that relative inefficiency exists ? Its the only table I have tested thus far and wondered if this would be consistent across the board or could it be specific to the data in this particular table. Many thanks.

Most of the work of collecting stats is done in parallel across all AMPs. However, parts of it are on a single AMP. For example, the histogram is built by a single AMP. That is the case for both FULL and SAMPLE stats. In addition, with sampled stats the database has scale up logic to scale the sample stats to full stats, but this is a very lightweight function and shouldn’t cause much CPU overhead. If the sample percent is very low and the table is small, it is possible that the scaling and histogram building is causing a slight imbalance across AMPs, but I would expect that to be noise.

The bulk of the work collecting stats should be pretty even across all AMPs, unless your data rows are skewed. And the process should be more or less same for FULL and SAMPLE stats. Check to see if the original table you are discussing has an uneven distribution of rows across AMPs. Also compare it with tables that are evenly distributed. You could also take sampled stats at a higher percent than the default of 2%, and see if the same phenomena still takes place.

We are planning to implement sample stats on some of our large tables (10+ billion rows) to free up CPU used by doing full stats. The tables have UPI on (location, item, date) and PPI on date. We will only do the sample stats on the UPI and keep the full stats on PARTITION, and date column. We've tested this on one table and was able to get 99.92% accuracy on the sample stats when compared to the full stats. We've also analyzed queries that access the table before (with full stats) and after (with sample stats) and performance is pretty much the same which is what we were expecting.

We are planning to roll this out to 8 other tables with similar characteristics when we came across this text from the "SQL Reference: Data Definition Statements" V2R6.1 Manual that says "Note that sampled statistics do not work well with PPI tables and should not be used to collect statistics for that table type." Can you please explain this further? We don't understand how the table being PPI have any relavance to sampled vs full stats. Doing this on 8 other tables with exactly similar characteristics would help free a lot of CPU secs (at least 1 million) that we use to collect full stats on a weekly basis so we'd really appreciate any input you could give. We are on V2R6.1 by the way.

It may be that you are reading less-than-current documentation. It used to be (not sure which release, but I think this is likely to the case in 6.2, which you are on) that the database allowed you to collect sampled stats on the partitioning column of a PPI table. This caused problems because rows in a PPI table are ordered first on disk by partitoning column value. So if you have, for example, 100 partitions and 100,000,000 rows in the PPI table and you are sampling 2% of your data, the sampling process will read only the first 2% of the data based on how it is stored in the database. When sampling on partitioning column, since all rows with the same partitioning column value will be stored together, all the values for the partitioning column will be the same, the first value, since 2% of the rows is 200,000 rows. Each partition will include about 1 million rows. So then you would end up with stats that say there is only one distinct value in the table.

So in later releases (not sure which) you are not allowed to do that operation. You will receive an error message if you try to collect using sampling on the partitioning column of a PPI table. Also you must always collect full stats on PARTITION. All other columns are fine for sampling.

Just to repeat, if you are on V2R6.2, then DO NOT collect using sampled stats on PARTITION or on the partitioning column of a PPI table.

Thanks for the quick response. We are on V2R6.1. It looks like the caution is really on the PARTITION or partitioning column and not really on the table type. We should be ok then since we will only be doing sampled stats on the UPI (location, item, date). We will collect full stats on PARTITION and partitioning column.

What is the best way to extract the output of the 'Help statistics DbName.TblName column Partition;", for the purpose of extracting the value of the column 'Number of Rows'. If you have attempted this through SQL, please share your thoughts. What I've already have in place is the following methodology...Run 'Collect Statistics on DbName.TblName Column (Partition);Run 'Help Statistics DbName.TblName Column (Partition);' through BTEQ with SideTitles onExtract the first Numbe of Rows and capture the value for the field.

While thats one way to do it, looking for the most optimal way to do this.(Currently on 12.x, not on 13.x yet, where the Sel count(*) would provide the same speed).

TeraAbe, Unfortunatey, I have not attempted what you are asking about, either with SQL or by other means. You could put your question out to some of the other Teradata forums and see what others who need similar information are doing.

There are two reasons you might want to collect stats on a unique index.

1. You have queries that use the unique index as a predicate with a range constraint.

2. Under some conditions, statistics collected on the PI of a table can support more accurate statistics extrapolation processes. There is an orange book by Rama Krishna Korlapati that discusses extrapolation in detail and includes an explanation of this. It might be helpful for you to look there.

Highly unique columns are quite different from unique columns. You do need stats collected on highly unique columns if they are used as predicates or as join constraints, otherwise the optimizer will not know that they are highly unique and could produce a less than optimal plan.

The default for USING SAMPLE is 2%, it can be modified globally using a dbscontrol field, but it's one of the internal parameters, as you have noticed. Internal settings are designed to be changed only by Customer Support or the Global Support Center.

As an alternative, you can adjust the sample size at the session level using this diagnostic, replacing the "30" with any value under 100 that you choose:

DIAGNOSTIC "COLLECTSTATS, SAMPLESIZE=30" ON FOR SESSION;

However, if you use the diagnostic to collect a statistics with a non-default sampling percent, then if you recollect at the table level, you will get the DBS Control setting which is 2% default for sampling on the recollection, not the specific sampling percent you specificed when you originally collected stats.

To preserve the sampling percent used in the original collection based on a diagnostic command, specify the same diagnostic command within the session that does table level recollections. The statistics collection process will apply that percentage, overriding the DBS Control default sampling setting, but only to those stat that were originally collected as sample stats, not all stats.

Your blog is really helpful. I have a question, does sample stats work with aggregate functions as well? If not then full stat coll. will be required? Also if a table has almost 5m data and nupi and I know it is slightly non unique (in other words mostly unique with small non uniqueness) then will increase percentage of sample stat be helpful?ThanksSantanu

If you are asking if using can use sampling when you collect statitics on an expression (whether aggregate or not), the anwer is yes, sampling for stats on expressions is supportable. Statistics collection on expressions is covered in the 14.10 orange book on Statistics Enhancements.

In terms of NUPIs that are slightly non-unique, I would refer you to the blog posting on Statistics Collection Recommendations for 14.0, where it makes the point that if NUPIs are NOT USED FOR JOINING, and their distribution of values is fairly even, you could rely on random AMP samples. You could also probably try relying on sampling. I would guess that slightly non-unique NUPIs would be OK with sampling.

That blog posting does differentiate NUPIs that are USED FOR JOINING, and says this:

NUPIs that are used in join steps in the absence of collected statistics are assumed to be 75% unique, and the number of distinct values in the table is derived from that. A NUPI that is far off from being 75% unique (for example, it’s 90% unique, or on the other side, it’s 60% unique or less) will benefit from having statistics collected, including a NUPI composed of multiple columns regardless of the length of the concatenated values. However, if it is close to being 75% unique, then random AMP samples are adequate. To determine what the uniqueness of a NUPI is before collecting statistics, you can issue this SQL statement:

Thanks for you reply. Sorry for my late response. Your explanation is really helpful.

However, I have another question. For example, I have collected sample stat on a table and now going to perform aggregation on a column from that table. Will sample stat play any part for the operation, or sample stat does not work out for such situation ?

You cannot actually apply USING SAMPLE to summary stats, which are the table-level statistics new in 14.0. I'm guessing you meant you were collecting sampled stats on a column or index of the table.

When you attach USING SAMPLE to a collect stats statement and then collect stats on that column, the resulting histogram will be treated the same was as if full stats had been collected. The sampling option only has an impact during the collection of statistics, and not how those stats are used. The optimizer will use the number of distinct values in the histogram to base its estimates on the row count that will come out of an aggregation process if sample or full stats have been collected on the GROUP BY column(s).

Whether or not sampled stats are as beneficial as collecting full stats in your case will depend on the degree of skew in the GROUP BY column. You could run an explain and look at the query plan row count estimate with sampled stats, then collect full stats and see if the estimate has changed very much.

If you are on 14.10 then this blog posting is no longer relevant. This posting is based on Teradata 13.10. Please read the orange book titled: Teradata 14.0 Statistics Enhancements, as it explains the new way of requesting sampled statistics starting in 14.0. There is also a blog posting that I wrote that is more current than this one that discusses sampling, titled: New Oppportunities for Statistics Collection in 14.0.

Don't use the diagnostic command. Instead add a USING SAMPLE n PERCENT on the collect statistics command itself. For example, do this:

The meaning of the diagnostic of “SAMPLESIZE” was changed in 14.10. It is no longer used to set an exact sample percent. So you should consider that diagnositc obsolete. As I said on my last response to you: Don't use the diagnostic command. Instead add a USING SAMPLE n PERCENT on the collect statistics command itself.

Starting in 14.10 this diagnostic is used for as an internal tunable for Teradata support personnel who are trying to debug a particular situation. It should not be used by customers.

It is possible for some stats on a table to have been collected with the USING NO SAMPLE, and other stats not. When SHOW STATS at the table level displays the phrase “-- default SYSTEM SAMPLE PERCENT” it indicates that the SYSTEM SAMPLE option may be considered for some of the statistics that have been collected on that table. I would suggest you issue a SHOW STATS COLUMN statement instead of the table level SHOW STATS and see if it displays the NO SAMPLE text for the columns with USING NO SAMPLE. Double-check all the collect stats statements for that table and the SHOW STATS output at the column and the table level, and if you still think there is something not right, open an incident with the support center.

Once you specify USING NO SAMPLE on a column, NO SAMPLE should be displayed in the output of SHOW STATS COLUMN (column_name). Here is how it should look:

I am going to comment out the "DIAGNOSTIC "COLLECTSTATS, SAMPLESIZE=20" ON FOR SESSION" and going to modify the table with USING NO SAMPLE & USING SAMPLE N PERCENT.

And also i have verified the SHOW STATS at Table level and column level. Both are reflecting the same.

Till now , i believe SAMPLE SIZE didnt work for us as we are using "DIAGNOSTIC "COLLECTSTATS, SAMPLESIZE=20" ON FOR SESSION" and tried to understand what is the percentage of value for “-- default SYSTEM SAMPLE PERCENT” it used.

Queried in the below way on dbc.statsv view , for the columns with “-- default SYSTEM SAMPLE PERCENT” the SAMPLESIZEPct is "NULL" ==>"Not Sampled".

So we are having Full Stats on every day for that columns when we are running Full stats on the table. Is my understanding correct?

It is very possible to have “-- default SYSTEM SAMPLE PERCENT” and the “SAMPLESIZEPct is NULL ==>Not Sampled". All that “-- default SYSTEM SAMPLE PERCENT” tells you is that the stat will be CONSIDERED for sampling, but the optmiizer will decide whether that will happen, and if it does at what percent.

So it is an erroneous expectation that you will experience sampling on a given statistic relying on system sampling. If you want sampling to take place specify USING SAMPLE N PERCENT on the collect statistics statement.

In order to have sampling take place when system sampling is being relied upon, many conditions have to present themselves:

System threshold must be off

DBQL usecount logging must be on

There must be adequate history records

The optimizer must be able to detect predictable patterns in growth over time when comparing the history records

And the SummaryInfo part fo the histogram only must have been being used by the optimizer when it accesses this statistic. That is the usagetype in the histogram summaryinfo section must be ‘S’ (for summaryinfo use only) not ‘D’ (for using the detailed histograms).

I am sorry, but I am having difficulty understanding your second question, when you say:

================

So we are having Full Stats on every day for that columns when we are running Full stats on the table. Is my understanding correct?

================

If you are saying the you wanted sampling to happen but got full statistics instead, the solution is to specify USING SAMPLE N PERCENT as mentioned above.

Thanks, -Carrie

0
Kudos

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.