Find Missing Date Ranges in SQL

Often, the quickest way to determine whether you have missing data in a table such a ledger or journal is to see if there are gaps in the dates where one wouldn't expect them. But how do you do that in an emergency, particularly in a large table, when every minute counts?

The Task Defined.

Imagine this: The customer runs the application that you've designed, and cannot see any sign of business transactions over a time period where he'd expect to see them. Your manager asks you to investigate. He will almost certainly ask you to find all of the missing days or hours in a date range where no data has been recorded: The date range can be several years.

The situation may become even worse; missing data can have unpleasant consequences, and that fact will make both the customer and the boss very nervous.

As a result, you will be under stress: you have to reassure your boss and the customer, and you need to do that in a very limited amount of time.

In this article, we will discuss a few SQL, one-query techniques that can help you to find the missing dates.

Preparations

Firstly, you will need to create and load an auxiliary table that will store the sequence of whole numbers. The volume of numbers in that table will be equal to the number of days in the year 2007 (see Listing1):

Secondly, generate some test data and load that data into a sample table. Then produce the gaps, using the next algorithm: delete dates, where the number of days since Jan 01, 1900, divided by 4 or 5 or 6 has zero in remainder.

There is nothing wrong with the solutions shown in the Listing3. However, if you need to find the missing hours in the whole year or the missing days in a few years, these queries may produce a long list of dates, which would be soon become unwieldy and difficult to interpret.

Your manager would not appreciate being given this sort of report.

Therefore, you need to find another solution, which will represent the result in a way that is more convenient for reading and analyzing.

Technique #2 –finding the ranges

Think of the missing dates as being a number of gaps in the continuous sequence of the dates. Then, instead of generating each missing date, you can show just the beginning and the end of each gap. The code in the Listing4 demonstrates two variants of such a solution:

Both queries from the Listing 4 use self-join and produce the same result in approximately the same time. However, if you test the queries on a larger sample set; for ten years, for example, instead of one year, you will find that the second query runs much faster than the first one.

The reason for such a difference in performance is the way in which the queries deal with dates.

The first query uses an arithmetic operator (+) in order to add one day to the date. The second query for the same operation employs SQL Server built-in function.

The first query from the Listing 4 is typical in showing how the use of arithmetic operations in date/time calculations can decrease the queries’ performance.)

So far, we examined the technique #2 on relatively small data sets.

There are, however, the situations when the amount of data that needs to be processed will be significantly larger. For example, your system may collect the POS (point of sale) transactions from multi-chain department and grocery stores located in different countries and (or) time zones.

In that case, if you asked to find the missing minutes (seconds) in a few months or even years, you may find that the fastest (second) query from the Listing 4 is also inefficient.

Then, you will need to enhance your queries in order to get the acceptable execution time.

One possible enhancement would be to employ an auxiliary column, as shown in the Listing 5:

The queries from the Listing 5 are based on the idea of SQL Server MVP Erland Sommarskog (Stockholm).

All you need to do is to assign the order number to each row in the table and then to compare the values (they can be dates or numbers or even letters) in the adjacent rows, using the order numbers.

The first query in the Listing 5 uses the identity column in order to enumerate the rows in the table.

The second and third queries use ranking function ROW_NUMBER().

All three queries are much faster than the fastest (second) query from the Listing 4.

Note, that queries from the Listing 5 require column theDate (table sampleDates) to be sorted before assigning the order number. Since table sampleDates has a primary key constraint, the clustered index is also created on column theDate. That means you don’t need to sort the data in table sampleDates explicitly: clustered index will do that automatically.

One more thing that you should note is the incomplete results, returned by all the queries from the Listing 4 and Listing 5.

Indeed, Jan 01, Dec 30 and Dec 31 are missing dates in table sampleDates. However, they are not shown in the result as the gaps, because all the queries treat Jan 02, 2007 as the beginning and Dec 29, 2007 as the end of the sample data set.

In order to get the correct result, you need to insert the dates Dec 31, 2006 and Jan 01, 2008 into the table sampleDates.

Technique #3 –the Pivot table

One more technique presents missing dates in the form of the spreadsheet.

This type of output is one of the most attractive for the human eye and probably is the favorite document type of your boss.

The technique shown in the Listing 6, uses SQL Server 2005 PIVOT operator in order to produce tabular output:

The beauty of the query from the Listing 6 is not only in its cross-tab view. That query also shows days of the week, which is very convenient, when you need to analyze business (application) that runs not whole week, let us say 5-6 days per week. In that case, the missing Saturdays and Sundays will be considered as weekends, but not as the missing days.

How To Find The Missing Hours

This bonus example shows how to apply the above techniques to the case of missing hours

You can also use this paragraph as a reference, when you need to find missing minutes or seconds.

-- Just to make the last hour of the year missing DELETE sampleDates_2 WHERE theDate = '2007-12-31 23:00:00';

-- In order to get the very first gap in the beginning of the year and -- the very last gap at the end of the year, add the next dates INSERT INTO sampleDates_2 VALUES('Dec 31, 2006 23:00:00'); INSERT INTO sampleDates_2 VALUES('Jan 01, 2008 00:00:00');

The following script shows the solutions for missing hours (see Listing 8):

The first solution from the Listing 8 shows the missing hours as the beginning and the end of the gap.

The second solution shows the number of missing hours per each day

If this is not enough, the third solution will give you detailed information about missing hours.

The Task Completed: Here Are The Results

You found missing dates and sent the results to your boss. After short investigation, the problem was identified: some data has been deleted from database by mistake, during the application upgrade.

In that situation, everybody relied on you and once again, you rose to the challenge. You found the old backup, restored the database under another name and moved the missing data to the original database.

Here is the result of your efforts:

The customer is satisfied. The manager impressed by your skills. You can be proud of your achievements.

The source code to this article can be downloaded from the speech-bubble at the top of the article by clicking on 'TSQL Source Code'

In a previous job, I was performing production support and reporting for a large e-commerce database. What helped was a sequentially incremented integer column (specifically an identity column) on each transaction table, so identifying missing gaps was more simplified and deterministic. Of course a gap in an identity column doesn't always indicate "missing" transactional data and the values are not always in strict sequence. Gaps may be the result rolling back an aborted transaction, and it's a good idea to investigate why an application is frequently rolling back transactions.

I agree. The gaps in identity column’s values, especially in the production database, may indicate a few serious problems: in application design; in hardware; even a bug that appears under high pressure (high volume of transactions) or when certain logical pattern(s) are met.It is may be a good idea to have a scheduled task that checks the gaps in identity columns and sends the alerts to DBA, when the amount or size of the gaps exceeds certain threshold.

First off, I'm a newbie to SQL (maybe one week on time off) but need a good query to find "openings" in an existing MS Access appointment table that is part of my office management software. Basically, I need to find the "next available appointment" from a table of existing appointments. So I'm looking for "gaps" or "openings" between appointment time intervals.

The relevant fields of this table are [Occur Date], [Time] and [Length] which are for the date of the appointment, it's time of day and how long it is. So an appointments start time would be [Occur Date]+[Time] and end time would be [Occur Date]+[Time]+[Length].

A note on [Length]: If the appointment time is 09:00 and has length 00:30 then it ends at 09:30 not 09:29 which means overlap at an instant of time into appointments that start at 09:30.

I tried the following query to get a set of opening times but it gives me "errors" which would also seem to occur from the code you present in your article. I was wondering what the query would be to avoid these problems or if instead of finding a table of openings it would be better just to find one opening at a time.

Anyway here is my query. The problems are that it gives back gaps of zero length since "back-to-back appointments overlap one another at one instant in time. Changing <= to < in(C.[Occur Date]+C.[Time]+C.[Length]) <= (B.[Occur Date]+B.[Time]) leads to problems of of these appointments not being taken into account.

The second problem is appointment intervals can overlap and one gets back gaps that don't exist.

Third, is that redundant gaps are found and this probably relates to the second problem of overlapping appointments.

First off, I'm a newbie to SQL (maybe one week on time off) but need a good query to find "openings" in an existing MS Access appointment table that is part of my office management software. Basically, I need to find the "next available appointment" from a table of existing appointments. So I'm looking for "gaps" or "openings" between appointment time intervals.

The relevant fields of this table are [Occur Date], [Time] and [Length] which are for the date of the appointment, it's time of day and how long it is. So an appointments start time would be [Occur Date]+[Time] and end time would be [Occur Date]+[Time]+[Length].

A note on [Length]: If the appointment time is 09:00 and has length 00:30 then it ends at 09:30 not 09:29 which means overlap at an instant of time into appointments that start at 09:30.

I tried the following query to get a set of opening times but it gives me "errors" which would also seem to occur from the code you present in your article. I was wondering what the query would be to avoid these problems or if instead of finding a table of openings it would be better just to find one opening at a time.

Anyway here is my query. The problems are that it gives back gaps of zero length since "back-to-back appointments overlap one another at one instant in time. Changing <= to < in(C.[Occur Date]+C.[Time]+C.[Length]) <= (B.[Occur Date]+B.[Time]) leads to problems of of these appointments not being taken into account.

The second problem is appointment intervals can overlap and one gets back gaps that don't exist.

Third, is that redundant gaps are found and this probably relates to the second problem of overlapping appointments.