DATE Datatype Or NUMBER Datatype – Which Should be Used?

I have a scenario where I need to store the data in the format YYYYMM (e.g. 201001 which means January, 2010). I am trying to evaluate what is the most appropriate datatype to store this kind of data. I am comparing 2 options, NUMBER and DATE. As the data is essentially a component of oracle date datatype and experts like Tom Kyte have proved (with examples) that using right datatype is better for optimizer. So I was expecting that using DATE datatype will yield (at least) similar (if not better) cardinality estimates than using NUMBER datatype. However, my tests show that when using DATE the cardinality estimates are way off from actuals whereas sing NUMBER the cardinality estimates are much closer to actuals.
My questions are:
1) What should be the most appropriate datatype used to store YYYYMM data?
2) Why does using DATE datatype yield estimates that are way off from actuals than using NUMBER datatype?

Test case (update Jan 7, 2010 : there was a copy-paste error in the line for collecting statistics on table B – the original version of the script posted here collected statistics on table A twice):

This is an interesting problem, why would using the NUMBER datatype yield better cardinality estimates than the example with the DATE datatype? When the NUMBER datatype was used, the optimizer predicted that the full table scan operation would return 46,604 rows, while the optimizer predicted that the full table scan would return 5,919 rows when the DATE datatype was used – the actual number of rows returned is 49,152.

The person who posted the above test case later stated that he believes that the DATE datatype is the correct choice, but he would have a difficult time justifying that opinion when confronted by someone suggesting the use of the NUMBER data type.

Well, it seems that Oracle 11.1.0.7 predicted that when the NUMBER datatype was used, the full table scan would return roughly 108,000 rows. Oracle 11.1.0.7 predicted that when the DATE datatype was used, the full table scan would return 57,166 rows – significantly closer to the actual number of 49,152. If there were an index on that column, how would the different cardinality estimates affect the possibility that the optimizer might select to use that index rather than a full table scan? What if the data volume were increased by a factor of, say 1,000 or 1,000,000?

I also captured a 10053 trace during the test run, and found this in the trace file:

Notice in the above that Oracle’s statistics gathering process did not create histograms when I collected statistics for the tables. The calculated cost is the same for either datatype, but what would happen if that table were then joined to another table? Is the optimizer seeing histograms in some of the original poster’s test cases?

The original poster is running Oracle Database 10.2.0.1, so I ran a test on Oracle 10.2.0.2 with OPTIMIZER_FEATURES_ENABLE set to 10.2.0.1, using the data created by the OP’s data creation script:

ALTER SESSION SET TRACEFILE_IDENTIFIER = 'DateTest';
ALTER SESSION SET EVENTS '10053 trace name context forever, level 1';
select count(*) from b where id between 200810 and 200903 ;
select count(*) from b where dt between to_date(200810, 'YYYYMM') and to_date(200903, 'YYYYMM') ;
ALTER SESSION SET EVENTS '10053 trace name context off';
set autotrace traceonly explain
select count(*) from b where id between 200810 and 200903 ;
select count(*) from b where dt between to_date(200810, 'YYYYMM') and to_date(200903, 'YYYYMM') ;
set autotrace off
select count(*) from b where id between 200810 and 200903;

The output from the above script, when run on Oracle 10.2.0.2 follows:

Notice that no histograms were collected based on the 10053 trace file.

Now a second test, this time we will instruct Oracle to create histograms, and also force the optimizer to hard parse the SQL statements that reference table B when those SQL statements are re-executed:

Interesting, both queries estimate that the full table scan operation will return 46,604 rows – interesting. That cardinality estimate exactly matches the cardinality estimate in the OP’s plan for the SQL statement that accessed the NUMBER datatype…

The 10053 trace file shows that in both cases a height balanced histogram with 254 buckets was created. But, how accurate would the estimate be if there were 1,000 or 1,000,000 times as many rows? What if the time interval were changed to something else? What if each of the 289 distinct values for the ID and DT columns did not have an equal distribution of values?

So, why select a DATE datatype rather than a NUMBER datatype? These are the reasons that I proposed in the discussion thread:

One of the problems with putting date values in number columns is this – if you select the range from 200810 to 200903, the optimizer will likely make the assumption that 200810 is just as likely of a number as 200808, 200812, 200814, 200816, 200820, 200890, 200900, etc. Some of those year/month combinations are simply not possible. In such a case, the optimizer should over-estimate the number of rows returned from that range when the column data type is NUMBER, and should be reasonably close when the column data type is DATE, since the optimizer knows that 200814 (14/1/2008), 200816 (16/1/2008), 200820 (20/1/2008), 200890 (90/1/2008), 200900 (0/1/2009), etc. could never be dates (and would be completely out of the serial sequence of dates). By putting the date type data into a DATE column, you have essentially added a constraint to the database to prevent invalid dates from being added. Additionally, date math, such as finding the number of days between 200802 and 200803 (compared to 200702 and 200703) is very simple – the answer is not 1 in both cases, but rather 29 and 28, respectively.

Any other comments?

OK, enough guessing, let’s try a couple tests. Here is the test table, with 10,000,000 rows with an uneven distribution of rows for each value:

In the above, the estimated number of rows with the comparison on the numeric column is about 250,000 rows above the actual, while the comparison on the date column is about 400,000 rows below the actual – could this be enough of a difference to change the execution plan if the clustering factor of the indexes were high? What if the tables had a larger average row length? What if this table were joined with another table? Note that the number of logical blocks accessed is less with the index on the numeric column.

In the above, the estimated number of rows with the comparison on the numeric column is about 10 times as high as the actual number of rows, while the estimated number of rows with the comparison on the date column is about 5 times as high. Could this be enough to trigger a different execution plan for the queries – where one uses an index access, while the other uses a full table scan?

This time, the optimizer’s estimate when the date datatype was used is very close, while the optimizer’s estimate when the numeric datatype was used is about 4 times greater than the actual number of rows.

This time with the numeric datatype, the optimizer is estimating 1,295,000 rows when in fact only 45,260 are returned during the index range scan. The estimate with the date datatype is also quite high, but it is 4 to 5 times lower (thus closer to the actual) than with the numeric datatype.

Here, with the wider date range, the optimizer is closer with the numeric data type, and has selected to perform a fast full scan of the index, which would use multi-block reads if disk accesses were required. Note that the optimizer selected to perform a full table scan when the date datatype was used, even though the estimated number of rows was less. Note too that this in-memory operation completed about twice as fast as the in-memory index fast full scan operation.

Now let’s take a look at what happens when a histogram is present on each of the columns. The script is identical to the previous script, except for the first two lines:

As the above indicates, with a 254 bucket histogram on both of the columns the optimizer calculates estimated row counts that are typically very close to the actual row counts for both datatypes – essentially the only difference is the number of logical reads. So, adding the histogram helps, but then what if the OP implements good coding standards and uses bind variables rather than constants (literals)?

Like this:

Related

Actions

Information

7 responses

6012010

Narendra(07:05:29) :

Charles,

Thanks a lot for dedicating a blog post and more importantly providing exhaustive test cases to elaborate the issue.
If you had read the OP’s response, it appears that OP is not in a position to influence the “implementing good coding standards”.🙂
As for usage of bind variables, what will happen when bind-peeking and adaptive cursor sharing kicks in ?🙂
BTW, I vaguly remember reading a blog post by Jonathan Lewis (I guess), the morale of which was “Usage of bind variables and presense of histograms generally do not go together”. I agreed with it. So are we now saying this kind of data pattern is exception to that statement?

p.s. BTW, I am the “OP” of the forum post and my personal opinion is still to use DATE datatype instead of NUMBER.🙂

“when the column data type is DATE, since the optimizer knows that 200814 … etc. could never be dates ”
I disagree with the assumption that ‘possible’ values play any part in the optimization process. A decimal value, such as 200810.32 is just as likely to be in a number field as 200811 or 200815. Similarly a DATE value can be ‘2008/10/15 12:31:33’ as easily as ‘2008/10/15’. I don’t believe there is any built in assumption that an integer or ‘day’ value is more likely to be in a field than a decimal or datetime. Instead it is the number of distinct values that plays a part, not possible values.

Say a column contains 150 rows, with 5 distinct values in a number column, with a low of 1 and a high of 150. The assumption Oracle makes is that any one value in that range occurs 30 times. If you do a select from that table for any specific value the estimated number of rows will always be 30, whether the value is 10 or 1.3443.
If you specify a range (eg 1 to 50) it determines what proportion of the entire range (1 to 150) is included in that requested range (eg 33%) and gives an estimate based on 33% of the table being returned. Except it doesn’t seem quite that simple as it assumes some sort of bell curve rather than an even distribution.

In my XE tests, I found the cardinality estimates for number and dates to be pretty close. I wonder if the closeness in distinct values (289) and max bucket size (254) for histograms meant there were some weird edge cases where the estimates were widely off for specific buckets.

I believe that my remark that you quoted is not as clear as it probably should be – in the OP’s system 200814 would be the 14th month of the year 2008, which would be an impossible date, but not an impossible number between 200812 and 200901, and without a histogram, the optimizer should assume that some of the 289 distinct values fall into that number range – that is the point I was trying to make. Again, without a histogram, would the optimizer assume that the same percentage of those 289 distinct values fall between 200811 and 200812 as it assumes fall between 200812 and 200901?

The optimizer’s estimated cardinality, without the histogram, would probably swing wildly if the date range was less than a couple years, for instance, sometimes including part of one year and part of the next.

For example (using the OP’s sample data), wrapping around the end of the year:

The OP has 24 years of data in the sample table. Will the cardinality difference really change the execution plan? Is the decrease in the number of consistent gets with the NUMBER datatype an advantage? How does the performance change when the average row length is much longer for the table? How does the performance change if there is an index? Does the order in which the rows are inserted matter? What about the CPU cycles burned converting the NUMBER column to a DATE for some comparison, or converting a string to a DATE value for inserting into the table or for use in the WHERE clause? What if bind variables are used, rather than constants?

I would like to encourage people to continue sharing their thoughts on this topic.

I see where you are coming from but still disagree. I reproduced the situation in XE. With a the basic DBMS_STATS call, with just estimate_percent=>NULL, it creates 1 bucket for the date and 254 for the number. I think the DBMS_STATS gathering is recognizing the ‘lumpiness’ of numbers. It gets 8000 for each of 200101 to 200112 and doesn’t get any for 200113 to 200199, so it sees that as skew and creates the histograms. It doesn’t see any skew for dates, so doesn’t create histograms.
If you force the stats package to a certain number of histograms for both columns, the cardinality estimates are pretty much the same.
My contention is that the datatype is affecting the SKEWONLY aspect of stats gathering and doesn’t directly affect the optimizer.

It is good to see that this article has sparked some degree of discussion. Gary, while your copy of XE and the OP’s copy of Oracle 10.2.0.1 created a histogram on the NUMBER column and not on the DATE column when this command was executed:

No histograms were created during my tests with Oracle 10.2.0.2, 10.2.0.4, or 11.1.0.7 when using the above command with the sample data from the OP’s test case (this was after correcting the copy-paste error as noted above). If a histogram were automatically created on the NUMBER column and the OP switches from using constants to bind variables and bind peeking is left enabled, there could be a significant price to pay for having that histogram present when the bind variable values change. Unfortunately, I do not have the book “Troubleshooting Oracle Performance” with me at the moment, but I seem to remember that the default behavior for histogram collection (important when METHOD_OPT is not specified in the DBMS_STATS call) could be adjusted (edit: pages 147-150 of “Troubleshooting Oracle Performance” describe how to adjust the defaults on 10g and 11g).

I do, however, have a copy of “Cost-Based Oracle Fundamentals” sitting next to me. Page 130 of that book shows why the following example on your blog article predicts such a low cardinality value:

select count(*) from test_150 where id < -5;

-5 is below the minimum value for the ID column, while 200814 (as used in my example) falls between the minimum value and the maximum value.

This is a good discussion.

Here is the explain plan output from 10.2.0.2 using the script at the top of this article:

Good discussion. Okay, I accept that it isn’t just the way the stats are gathered, and the optimizer is involved.
Now, with a low value of 200101 and a maximum of 202501, that is a (numeric) span of 2400. The range 200810 to 200903 is 94 (or about 4% of the total)
With a low of 2001-01-01 and high of 2025-01-01 we have a span of 8766 days. The range 2008-10-01 to 2009-03-01 is 151 days (or about 1.7% of the total)
We can measure the size of date ranges in weeks (1252 and 21.5 respectively) but the ratio of the ‘range of interest’ to ‘range of existing values’ remains the same at around 1.7%
With 2.3 million values and 8000 rows per distinct value, 4% would be about 93000 from the range, add in the 8000 for the top of range and 8000 for the bottom and it comes close to the 108K estimate.
Similarly 1.7% would be around 41,000 plus the 8000 for the top and 8000 for the bottom comes close to the 57K estimate.

If you use “id in (200110,200111,200112,200201)” and “dt in (date ‘2001-10-01’, date ‘2001-11-01’, date ‘2001-12-01’, date ‘2002-01-01′)”, then they both come out with consistent estimates.

So with either dates or numbers, the range estimates aren’t great. With dates it over-estimates by 8000 (number of records per distinct value). With numbers, the estimate is off by a variable amount depending on how many times you span a chunk representing the missing ’13-99’ months. It would start off by significantly under-estimating for small ranges not including a chunk, over-estimating where it does include a chunk and getting less incorrect the larger the range. At various points the incorrect estimates would balance out. Sort of like a stopped clock being right twice a day, I’d prefer not to rely on it.

Hints for Posting Code Sections in Comments

********************
When the spacing of text in a comment section is important for readability (execution plans, PL/SQL blocks, SQL, SQL*Plus output, etc.) please use a <pre> tag before the code section and a </pre> tag after the code section:
<pre>

SQL> SELECT
2 SYSDATE TODAY
3 FROM
4 DUAL;
TODAY
---------
01-MAR-12

</pre>
********************
When posting test case samples, it is much easier for people to reproduce the test case when the SQL*Plus line prefixes are not included - if possible, please remove those line prefixes. This:

SELECT
SYSDATE TODAY
FROM
DUAL;

Is easier to execute in a test case script than this:

SQL> SELECT
2 SYSDATE TODAY
3 FROM
4 DUAL;

********************
Greater than and Less than signs in code sections are often interpretted as HTML formatting commands. Please replace these characters in the code sections with the HTML equivalents for these characters: