3
What Are These Column Level Statistics Used For? >The Histogram Values >Describes the distribution of values in the column >Belongs to a column, not an index >Used in costing SARGs >A step is the point in the column where a value is read to obtain a boundary value >A cell represents the rows that fall between two steps >Each cell has a weight which is the fraction of rows in the column it represents - read as a percentage of all rows >There are approximately the same number of rows in each cell - except Frequency count cells

6
Statistics On Inner Columns of Composite Indexes Stats on inner columns of composite indexes Think of a composite index as a 3D object, columns with statistics are transparent, those without statistics are opaque >Columns with statistics give the optimizer a clearer picture of an index – sometimes good, sometimes not >This is a fairly common practice >Does add maintenance >update index statistics most commonly used to do this update index statistics tab_name [ind_name]

10
Statistics On Non-Indexed Columns and Joins Stats on non-indexed columns Cant help with index selection but can affect join ordering >Columns with statistics give the optimizer a clearer picture of the column – no hard coded assumptions have to be used >When costing joins of non-indexed columns having statistics may result in better plans than using the default values >Without statistics there will be no Total density or histogram that the optimizer can use to cost the column in the join >Yes, in some circumstances histograms can be used in costing joins – if there is a SARG on the joining column and that column is also in the join table then the SARG from the joining table can be used to filter the join table >If there is no SARG on the join column or on the joining column the Total density value (with stats) or the default value (w/o stats) will be used

12
Statistics On Non-Indexed Columns and Joins - Example select * from TW1,TW2 where TW1.A=TW2.A and TW1.A =805975090 A simple join with a SARG on the join column of one table Table TW2 column A has no statistics, TW1 column A does Selecting best index for the JOIN CLAUSE: (for TW2.A) TW2.A = TW1.A TW2.A = 805975090 Inherited from SARG on TW1 But, cant help…no stats Estimated selectivity for A, selectivity = 0.100000. The best qualifying access is a table scan, costing 13384 pages, with an estimate of 50000 rows to be returned per scan of the table, using no data prefetch (size 2K I/O), in data cache 'default data cache' (cacheid 0) with MRU replacement Join selectivity is 0.100000. Inherited SARG from other table doesnt help in this case

13
Statistics On Non-Indexed Columns and Joins – Example cont. Without statistics on TW2.A the plan includes a reformat with TW1 as the outer table FINAL PLAN (total cost = 2855774): varno=0 (TW1) indexid=2 (A_E_F) path=0xfbd46800 pathtype=sclause method=NESTED ITERATION varno=1 (TW2) indexid=0 () path=0xfbd0bb10 pathtype=join method=REFORMATTING >Not the best plan – but the optimizer had little to go on

14
Statistics On Non-Indexed Columns and Joins – Example cont. Table TW2 column A now has statistics. The inherited SARG on TW1.A can now be used to help filter the join on TW2.A Selecting best index for the JOIN CLAUSE: TW2.A = TW1.A TW2.A = 805975090 Estimated selectivity for A, selectivity = 0.001447, upper limit = 0.052948. The best qualifying access is a table scan, costing 13384 pages, with an estimate of 724 rows to be returned per scan of the table, using no data prefetch (size 2K I/O), in data cache 'default data cache' (cacheid 0) with MRU replacement Join selectivity is 0.001447.

16
The Effects of Changing the Number of Steps (Cells) The Number of Cells (steps) Affects SARG Costing – As the Number Of Steps Changes Costing Does Too Cell weights and Range cell density are used in costing SARGs >Cell weight is used as columns upper limit Range cell density is used as selectivity for Equi-SARGs – as seen in 302 output >Result(s) of interpolation is used as column selectivity for Range SARGs >Increasing the number of steps narrows the average cell width, thus the weight of Range cells decreases >Can also result in more Frequency count cells and thus change the Range cell density value >More cells means more granular cells

17
The Effects of Changing the Number of Steps (Cells) cont. Average cell width = # of rows/(# of requested steps –1) >Table has 1 million rows, requested 20 steps - >1,000,000/19 = 52,632 rows per cell >1,000,000/199 = 5,025 rows per cell >What does this mean? >As you increase the number of steps (cells) they become narrower – representing fewer values >Well see that this has an effect on how the optimizer estimates the cost of a SARG >update statistics ……. using X values create index ….. using X values

19
The Effects of Changing the Number of Steps (Cells) cont. With 200 cells (steps) in the histogram Range cell density: 0.0002303825911991 77 0.00507200 <= 839463989 78 0.00506000 <= 842019895 >SARG value falls into cell 78 Estimated selectivity for B, selectivity = 0.000230, upper limit = 0.005060. In this case more cells result in a lower estimated selectivity >Increasing the number of steps has decreased the average width and lowered the Range cell density and the average cell weight. >Range cell density decreased because Frequency count cells appeared in the histogram

20
The Effects of Changing the Number of Steps (Cells) cont. Changing the number of steps – effects on Range SARGs - select * from TW2 where B between 825570000 and 830000000 With 20 cells (steps) in the histogram Range cell density: 0.0012829768785739 9 0.05263200 <= 825569337 10 0.05264200 <= 842084405 >SARG values fall into cell 10 Estimated selectivity for B, selectivity = 0.014121, upper limit = 0.052642. >Here selectivity is the product of interpolation, upper limit is the weight of the qualifying cell. >Interpolation estimates how much of cell will qualify for the range SARG

22
Some Statistics Related FAQs cont. How many steps should I request? >It will depend on your data and your queries >Increase requested steps to get Frequency count cells when there are highly duplicated values >FC only represents one value - very accurate weight >Range SARGs will estimate what portion of a cell qualifies for the SARG >More cells means narrower cells (represent fewer values) >Narrower cells mean more accurate estimates >Can have an affect on equi-SARGs - lower selectivity

23
Removing Statistics Can Effect Query Plans Sometimes no statistics are better then having them This will usually be an issue when very dense columns are involved Histogram for column: E" Step Weight Value 1 0.00000000 < "no" 2 0.47256401 = "no" 3 0.00000000 < "yes" 4 0.52743602 = "yes This can also show up when you have spikes (Frequency count cells) in the distribution

26
Maintaining Tuned Statistics Tuned statistics will add to your maintenance Any statistical value you write to sysstatistics either via optdiag or sp_modifystats will be overwritten by update statistics >Keep optdiag input files for reuse >If needed get an optdiag output file, edit it and read it in >Keep scripts that run sp_modifystats >Rewrite tuned statistics after running update statistics that affects the column with the modified statistics

27
Monitoring Table/Index Level Fragmentation Using The Statistics Can Be Both An Optimizer and Space Management Concern The more fragmentation the less efficient page reads are >Deleted rows – fewer rows per page, affects costing >Forwarded rows – 2 I/O each, optimizer adds to costing >Empty data/leaf pages – more reads may be necessary >Clustering can get worse >Watch the DPCR of the table or APL clustered index >In general the Cluster Ratios are not a good indicator of fragmentation since they are often normally low >Use optdiag outputs to monitor these values

29
Maintaining the Statistics When data changes the statistics become out of date In general up to date statistics are needed to get the best query plans >Statistics are usually updated using update statistics commands >The more statistics you have the more maintenance >Its a trade off between the gain in query performance and the increased statistics maintenance >Theres no point in updating statistics if the table is static

30
Update Statistics >Update statistics has been extended to allow for placement of statistics on columns >update statistics table_name (col_name) >update index statistics table_name [ind_name] >update all statistics table_name >Specify the requested number of steps (cells) to use when building the columns histogram >update statistics table_name (col_name) using 200 values

31
How Update Statistics Works Column and table/index values have to be read in order to gather the statistics >What does it do? >Reads the column to gather information for density and histogram, writes the column level statistics >While reading the column it gathers index/table level statistics – row & page count, forwarded rows, deleted rows, the cluster ratios, etc. >Takes a sample value every X rows for a histogram boundary value - (based on the number of rows and requested steps) If same value for multiple steps save it to make an FC

32
How Update Statistics Works cont. >Values have to be in sorted order for statistics gathering >If its the leading column of an index no sort is necessary >Just scan index leaf for statistics >If not the leading column of an index - create a worktable, read values in, sort and scan for statistics update statistics tab_name (col_name)- a table scan will be done to read the column update index statistics (ind_name)- then only an index scan (with a sort of the inner columns) >The sort is done in a worktable in tempdb. update index and update all statistics will use a lot of tempdb space unless sampling is used

33
Some Statistics Related Myths & Legends Update statistics will result in improved performance >Only guarantees up to date statistics >Due to distribution statistics may not give a pretty picture of the column Always use update all statistics >Rarely need statistics on all columns of a table >Can take a VERY long time to run, makes maintenance difficult at best >Should consider adding stats to composite index columns

35
Sampling for Update Statistics A new feature in 12.5.0.3 Designed to dramatically reduce the time it takes to update statistics – can dramatically speed up the running of update statistics >Opens your maintenance window >Decreases the cost of using ASE Randomly selected pages are read instead of reading all pages to gather the column level statistics – less I/O >The percentage of pages to be sampled can be specified update statistics tab_name with sampling = X percent >X is the percentage of pages you want to sample Can be between 1 and 100

36
Definitions >Column Level Statistics – those statistics that describe the values in the column to the optimizer – an attribute of a colum (i.e.; the histogram and density values) >Sampling – randomly reading rows from a specified percentage (subset) of pages rather than all pages of the table in order to gather column level statistics >Sampling Rate – the specified percentage of pages to read >Full Scan – to gather statistics by reading all pages of the object (table or index) >Major Attribute of an Index – the leading column of an index as listed in the create index command

37
Sampling for Update Statistics cont. >Unofficial tests show that a sampling rate of 10% on a 1 million row numeric column reduces the time for update statistics to run from 9 minutes to 30 seconds

38
Sampling for Update Statistics cont. >The Resulting histogram will be based on the values that are sampled It will differ from a histogram obtained from a full scan update statistics The lower the specified percentage of sampling the more the histogram will differ from a full scan histogram Test your queries against sampled statistics. In most cases you wont see any major changes Density values not updated by sampling >In most cases this wont be an issue.

39
Why Sampling for Update Statistics? >As datasets have grown the time it takes to run update statistics has also grown – Dramatically!! >This became more of an issue with update index statistics introduced in 11.9.x due to extra sort in worktable >TCO and auto-tuning/admin require a faster way to run update statistics >Without a faster update statistics neither efforts would succeed >Speeding up update statistics is a long standing Customer feature request >Random page sampling is the most I/O efficient method >Dramatically decreased the time to run update statistics

40
Why Sampling for Update Statistics? cont. >Some time test results – Not official, not for general release >your mileage may vary >Timings are from tests run by Sybase QA >1 million row int colum – timings based on elapsed time 20% sampling rate – Full scan time :2465850 Sampling time : 398783 Percentage of savings time(elapsed time):83% 10% sampling rate – Full scan time :2139013 Sampling time : 153130 Percentage of savings time(elapsed time):92% >Variations in full scan time are taken into account

41
How Does It Work? >Specify the percentage of pages to read via update statistics >with sampling = X percent Percent value can be between 1 and 100 >with extensions must follow using – with sampling = x percent and/or with consumers = x must follow using X values update statistics authors(auth_id) using 40 values with percent = 10 >Sampling reads all rows from each page read >Row values are moved to the worktable to be sorted and the statistics gathered >This saves tempdb space since the sampled sets of values are smaller than if the whole column was read into the worktable

42
How Does It Work? cont. Specific update statistics syntax and their affects update statistics table_name [index_name] with sampling = X percent >Will full scan index pages to update/create statistics on the major attribute(s) of the specified index or all indexes on the table ignoring the specified sampling rate – sampling will not be done

43
How Does It Work? cont. update index statistics tab_name [ind_name] with sampling = X percent >Will full scan index pages to update/create statistics for the major attribute(s) of the indexes or specified index on the table, ignoring sampling. >For minor index attributes will use sampling to scan the requested percentage of pages, read those values into a worktable, sort and gather statistics from there. >The space used in tempdb will decrease as the sampling rate decreases update statistics tab_name (col_name) with sampling = X percent >Will use sampling to update/create statistics for the specified column using the specified sampling rate. This applies to all columns whether major attributes of an index or not >Will not affect multi-column density values

44
How Does It Work? cont. update all statistics table_name with sampling = X percent >Will full scan index pages to gather statistics for the major attribute of all indexes – will not use sampling on these columns >Will use sampling to gather statistics for all columns that are not the major attribute of an index >The space used in tempdb will decrease as the sampling rate decreases

45
How Does It Work? cont. >Sampling is not used for create index >Since a full scan is required to build an index there is no additional cost for building the statistics

46
Trade Offs >A sampled set of anything is not as accurate as examining the most effective sampling rate for a given dataset >A histogram created with sampling is not likely to match a histogram created via a full scan >Histogram boundary values will vary >Cell weights will vary >Minimum and maximum histogram boundary values will vary >Since cell weight(s) and Range cell density are used to cost all SARGs a histogram from a sampled set will have an affect on SARG costing >Variations in the upper and lower histogram values may result in out of bounds costing by the optimizer >The smaller the sampling rate the greater the variance is likely to be

47
Trade Offs cont. >If there are existing density values they will not be overwritten. If there are no density values a default value of 0.100000 will be used for both Range cell and Total density values >There is currently no information saved about the use of sampling (whether or not it was used and the sampling rate) >Different cell types may appear >As the sampling rate decreases it is possible that Frequency count and/or Range cells may appear where they didnt exist prior to sampling >The same pages will be resampled if the dataset is static and the same sampling rate used

50
Tuning and Troubleshooting >Trial-and-error testing/tuning will need to be done to determine the most optimal sampling rate for a given dataset >In most cases variations in the statistics will have no affect In other cases small variations may change query plans >There is no rule of thumb on what sampling rate to use >In some cases the same sampling rate may be fine across all or most tables/columns. >In some cases sampling may not result in efficient plans >Use showplan and traceon 302/310 outputs to track changes to the query plan as the sampling rate changes >Using sample queries get above outputs from statistics gathered by a full scan. Update statistics with the sampling rate, rerun query and compare outputs

51
Tuning and Troubleshooting cont. >Use optdiag to monitor changes to the histogram >Check optdiag of full scan histogram for upper and lower boundary values these can be inserted into the histogram if needed >Keep a copy of optdiag output file as a backup of statistics in case old values need to be reloaded

52
Future Enhancements >This first implementation of sampling will require some enhancements >Scale density values gathered by sampling so that they are more accurate >Track the min/max values in the column in order to maintain the upper and lower boundary values of the histogram >Sampling index pages >Will help decrease the time of running update statistics even further >Add a mechanism to record if sampling was used and what sampling rate was last used >Add this information to optdiag and traceon 302 (and future optimizer diagnostics)

54
Where To Get More Information >The latest Performance and Tuning Guide >Dont be put off by the ASE 12.0 in the title, it covers the 11.9.2 features/functionality too >http://sybooks.sybase.com/onlinebooks/group-as/asg1200e >Any Whats New docs for a new ASE release >Tech Docs at Sybase Support >http://techinfo.sybase.com/css/techinfo.nsf/Home >Upgrade/Migration help page >http://www.sybase.com/support/techdocs/migration