TSQL – Solve it YOUR Way – Gaps and Islands… With a Twist!

Introduction:

As part of the blog series TSQL – Solve it YOUR Way, today’s topic is a variation of the popular T-SQL question titled “Gaps and Islands” followed by three different solutions presented by three of the more helpful and creative contributors in the TSQL MSDN forums, Steven Wang, Alejandro Mesa, and Tom Cooper.

Gaps and Islands questions revolve around sequential data with missing values. The term “gaps” refers to the ranges of missing values, while the team “islands” refers to the ranges of existing/sequential values. This topic takes on many forms and shows up occasionally in SQL Server interviews, with variations such as the following:

Given a table of natural numbers, return a two column table showing the missing numbers (ex: missing from range 6-8, 11-21, etc)

Find the missing Order IDs given an Orders table with sequential Order IDs

Write a function that returns the hours worked in the past month by a specific employee

Topic: Given a series of integer ranges representing painted segments of a bridge, find the areas of the bridge where there is no paint, as well as the areas where the paint has been overlapped.

Recently, a question was asked in the TSQL MSDN forums that was a variation of the gaps and islands question, with a small twist. The question revolves around a bridge of length 100 that has been painted in various segments as shown with the following data:

For this exercise, we need to determine how to return the gaps (the areas where the bridge is NOT painted) as well as where the paint has been overlapped.

Solution #1: Provided by Steven Wang

Code Snippet

/****************************************************************

Solution Preparation:

1. Create a temp table #BridgeP

2. Create a temp table #BridgeMaxLength to hold the max length of all bridges

3. Insert some test data

4. Create nonclustered index to boost the performance for large volume of data.

As documented in the comments throughout the solution, Steven takes the following approach.

Since the bridge paint records have overlaps, we first need to identify all of those paint starting points for the entire bridge. These starting points must not be adjacent to any end painting points recorded and must be less than any other starting points within the same packed painting group.

Using similar logic, but in a reversed order, we will identify the ending points.

Since a bridge should always start from 0 and end with the maximum length, we add these 2 points together with all starting points and end points. The union all query results must come back with an even number (pairs of start and end point) for each bridge.

In order to put starting point and ending point pairs into the same group, I use the Row_Number function:(Row_Number() Over(Partition By BridgePID Order By GapStart) – 1) / 2If the row number is 1 or 2, then return 0.

Solution #2: Provided by Alejandro Mesa

Code Snippet

SETNOCOUNTON;

USEtempdb;

GO

Declare@BridgePTable(

BridgePIDint,

FromLenDecimal(14,4),

ToLenDecimal(14,4)

);

Declare@MaxBridgeLenghtDecimal(14,4)= 100.0000;

Insertinto@BridgeP

Values

(1,10.0002,20.0004)

,(2,18.0000,21.0004)

,(3,11.12345,20.0002)

,(4,40.0000,50.0003)

,(5,19.1100,35.0000)

;

WITHC0AS (

SELECT

FromLen,ToLen

FROM

@BridgeP

UNIONALL

SELECT

FromLen,ToLen

FROM

(VALUES (0, 0),(100, 100))AST(FromLen,ToLen)

)

,C1AS (

SELECT

FromLenASln,

+1 AS[type],

1 ASsub

FROM

C0

UNIONALL

SELECT

ToLenASln,

–1 AS[type],

0 ASsub

FROM

C0

)

,C2AS (

SELECT

ln,

SUM([type])OVER(

ORDERBYln,[type]DESC

ROWSBETWEENUNBOUNDEDPRECEDINGANDCURRENTROW

)–subAScnt

FROM

C1

)

,C3AS (

SELECT

ln,((ROW_NUMBER()OVER(ORDERBYln)– 1)/ 2)+ 1 ASgrp

FROM

C2

WHERE

cnt= 0

)

,C4AS (

SELECT

MIN(ln)ASFromLen,

MAX(ln)ASToLen

FROM

C3

GROUPBY

grp

)

,C5AS (

SELECT

*,

LEAD(FromLen, 1,ToLen)OVER(ORDERBYFromLen)ASnxt_FromLen

FROM

C4

)

SELECT

ToLenASGapStart,

nxt_FromLenASGapEnd

FROM

C5

WHERE

nxt_FromLen–ToLen> 0.0001;

GO

/*

GapStart GapEnd

0.0000 10.0002

35.0000 40.0000

50.0003 100.0000

*/

The beauty of this approach is that it could easily be adapted to report on multiple bridges or roads, whatever it is that you are painting, as in the following modified example.

I am going to use his approach to tackle your question since I didn’t see any solution using the new support to the OVER clause (window functions).

Finding gaps using LEAD function is simple and I think that reading the code is enough. For the solution about packing overlapping ranges, the key is in splitting the range (each row) in two sets, one for the initial of the range (FromLen) and another for the end (ToLen), and adding a type (+1 or -1) depending if it is the start (FromLen) or the end of the range (ToLen). We are going to calculate the running total of [type] minus the [sub], and we will be interested only on those where the formula ([cnt]) is zero. This will give us pairs of start and end, and we can identify the pairs by enumerating those rows and using ((ROW_NUMBER() OVER(…) – 1 / 2) + 1), the rest is grouping and calculating the MIN and MAX per group.

Note: I didn’t include an ORDER BY clause for presentation purpose, neither setup proper indexes to support the OVER clause.

As noted in the original forum post, this is essentially an “Islands and Gaps” problem. The desired result is all the islands of unpainted points. But the input data is somewhat atypical. In a typical islands and gaps problem, the data is a series of points (for example, you have a table where each row had two columns – employeeid and sickday (datatype date)). And you want a report show the ranges of days an employee was out sick. So that you could produce a report like Employee 5 was out Jan 5 thru Jan 7. There is a good deal of information on how to solve this type of problem available both online and in books. One good discussion is by Itzik Ben-Gan at http://www.manning.com/nielsen/SampleChapter5.pdf.

So if the input data was a set of rows showing all the unpainted points, using that technique would solve our problem. But instead, the data is a series of ranges of painted points (possibly overlapping). So we want to first find all the points which are not in any of those ranges. To do this we need to first get all numbers between the minimum and maximum points (1 and 100 in this case). We do that either by using a Numbers table that was previously created or by building a Numbers cte as part of the query. I choose to build one with a cte since that way I did not have to assume the OP had a permanent numbers table. The method I used is a common one, the earliest reference I know to it is also by Itzik Ben-Gan at http://www.sqlmag.com/article/sql-server/virtual-auxiliary-table-of-numbers. Once I had this table of numbers I then kept only the numbers which were in the range 1 to100 and were not in any of the painted ranges in the original data. What’s left is a standard islands and gaps problem.

Conclusion:

As you can see, all three of the above solutions provide the result we were looking for, but do so in a very different style. The original thread provides variations of the solutions presented here as well as further discussion. Each of these solutions leverages different SQL Server language constructs and includes different considerations in the final solutions. I hope that you are able to learn a lot by trying out the problem yourself and then reading through the additional solutions.

Special thanks to Steven, Alejandro, and Tom for their valuable forums contribution and for contributing to this series!

Hope that helps,Sam Lester (MSFT)

Contributor Bios:

Steven Wang has worked with SQL server for more than 10 years. Steven is an active SQL server community participant and speaks at events such as TechEd, CodeCamp, SQL User Group etc.