Merging Temporal Intervals with Gaps

In Joining Temporal Intervals I explained how to join multiple temporal tables. The provided solution merges also temporal intervals but – as pointed out in that post – may produce wrong results if the underlying driving table is not gaplessly historized. In this post I’ll explain how to merge temporal intervals with various data constellations correctly.

You find the scripts to create and populate the tables for scenario A and B here.

Scenario A – No overlapping intervals

This scenario handles consistent data. This means no overlapping intervals, no duplicate intervals, no including intervals, no negative intervals (valid_to < valid_from). Here is the content of the example table t1:

Test Data Set t1

1

2

3

4

5

6

7

8

9

10

11

12

13

SQL> SELECT * FROM t1;

VID OID VALID_FROM VALID_TO C1 C2

---------- ---------- ----------- ----------- ---------- ----------

1 1 01-JAN-2010 31-DEC-2010 A B1

2 1 01-JAN-2011 31-MAR-2011 A B2

3 1 01-JUN-2011 31-JAN-2012 A B2

4 1 01-APR-2012 31-DEC-9999 A B4

5 2 01-JAN-2010 31-JUL-2012 B B1

6 2 01-AUG-2012 31-DEC-9999 B B2

18 4 01-JAN-2010 30-SEP-2011 D D1

19 4 01-OCT-2011 30-SEP-2012 D2

20 4 01-OCT-2012 31-DEC-9999 D D3

I’d like to write a query which produces all intervals for the columns OID and C1 honoring gaps in the historization. For OID 1 I expect that record 1 and 2 are merged, but records 3 and 4 are not merged because the intervals are not “connected”. For OID 2 I expected to get a single merged interval. For OID 4 I expect to get 3 records, since the records with C1=’D’ are not connected.

So, the following query result is expected:

Expected Result

1

2

3

4

5

6

7

8

9

OID VALID_FROM VALID_TO C1

---------- ----------- ----------- ----------

1 01-JAN-2010 31-MAR-2011 A

1 01-JUN-2011 31-JAN-2012 A

1 01-APR-2012 31-DEC-9999 A

2 01-JAN-2010 31-DEC-9999 B

4 01-JAN-2010 30-SEP-2011 D

4 01-OCT-2011 30-SEP-2012

4 01-OCT-2012 31-DEC-9999 D

The next query produces exactly this result.

Query to Merge Non-Overlapping Intervals

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

WITH

calc_variousAS(

-- produces column has_gap with the following meaning:

-- 1: offset > 0 between current and previous record (gap)

-- 0: offset = 0 between current and previous record (no gap)

-- produces column new_group with the following meaning:

-- 1: group-by-columns differ in current and previous record

-- 0: same group-by-columns in current and previous record

SELECToid,

valid_from,

valid_to,

c1,

c2,

CASE

WHENLAG(valid_to,1,valid_from-1)OVER(

PARTITIONBYoidORDERBYvalid_from

)=valid_from-1THEN

0

ELSE

1

ENDAShas_gap,

CASE

WHENLAG(c1,1,c1)OVER(

PARTITIONBYoidORDERBYvalid_from

)=c1THEN

0

ELSE

1

ENDASnew_group

FROMt1

),

calc_groupAS(

-- produces column group_no, records with the same group_no

-- are mergeable, group_no is calculated per oid

SELECToid,

valid_from,

valid_to,

c1,

c2,

SUM(has_gap+new_group)OVER(

PARTITIONBYoidORDERBYoid,valid_from

)ASgroup_no

FROMcalc_various

),

mergedAS(

-- produces the final merged result

-- grouping by group_no ensures that gaps are honored

SELECToid,

MIN(valid_from)ASvalid_from,

MAX(valid_to)ASvalid_to,

c1

FROMcalc_group

GROUP BYoid,c1,group_no

ORDERBYoid,valid_from

)

-- main

SELECT*FROMmerged;

The usability of the WITH clause aka Subquery Factoring Clause improved significantly with 11gR2. Since then it’s not necessary to reference all named queries anymore. The named queries become real transient views and this simplifies debugging a lot. – If you replace the content of line 57 with “SELECT * FROM calc_various ORDER BY oid, valid_from;” the query produces the following result:

Result of calc_various

1

2

3

4

5

6

7

8

9

10

11

OID VALID_FROM VALID_TO C1 C2 HAS_GAP NEW_GROUP

--- ----------- ----------- -- -- ------- ---------

1 01-JAN-2010 31-DEC-2010 A B1 0 0

1 01-JAN-2011 31-MAR-2011 A B2 0 0

1 01-JUN-2011 31-JAN-2012 A B2 1 0

1 01-APR-2012 31-DEC-9999 A B4 1 0

2 01-JAN-2010 31-JUL-2012 B B1 0 0

2 01-AUG-2012 31-DEC-9999 B B2 0 0

4 01-JAN-2010 30-SEP-2011 D D1 0 0

4 01-OCT-2011 30-SEP-2012 D2 0 1

4 01-OCT-2012 31-DEC-9999 D D3 0 1

You see that the value 1 for HAS_GAP indicates that the record is not “connected” with the previous record. Additionally the value 1 for the column NEW_GROUP indicates that the records must not be merged even if they are connected.

To simplify the calculation of NEW_GROUP for multiple group by columns (used in the named query “merged”) build a concatenated string of all relevant columns to deal with a single column similar to the column C1 in this example.

HAS_GAP and NEW_GROUP are used in the subsequent named query calc_group which produces the following result:

Result of calc_group

1

2

3

4

5

6

7

8

9

10

11

OID VALID_FROM VALID_TO C1 C2 GROUP_NO

--- ----------- ----------- -- -- --------

1 01-JAN-2010 31-DEC-2010 A B1 0

1 01-JAN-2011 31-MAR-2011 A B2 0

1 01-JUN-2011 31-JAN-2012 A B2 1

1 01-APR-2012 31-DEC-9999 A B4 2

2 01-JAN-2010 31-JUL-2012 B B1 0

2 01-AUG-2012 31-DEC-9999 B B2 0

4 01-JAN-2010 30-SEP-2011 D D1 0

4 01-OCT-2011 30-SEP-2012 D2 1

4 01-OCT-2012 31-DEC-9999 D D3 2

The GROUP_NO is calculated per OID. It’s technically a running total of HAS_GAP + NEW_GROUP. All intervals with the same GROUP_NO are mergeable. E.g. record 2, 3 and 4 have a different GROUP_NO which ensures that every single gap is honored for OID 1.

Scenario B – With Overlapping intervals

Reality is that we have sometimes to deal with inconsistent data. E.g. duplicate data intervals, overlapping data intervals or even negative data intervals (valid_to < valid_from). I’ve created an example table t2 which is in fact a copy of t1 but includes an additional messy OID 3 with the following data:

Test Data Set t2

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

SQL> SELECT * FROM t2 WHERE oid >= 3 ORDER BY vid;

VID OID VALID_FROM VALID_TO C1 C2

---------- ---------- ----------- ----------- ---------- ----------

7 3 01-JAN-2010 31-DEC-2010 C B1

8 3 01-JAN-2010 31-MAR-2010 C B2

9 3 01-JUN-2010 31-AUG-2010 C B3

10 3 01-OCT-2010 31-DEC-2010 C B4

11 3 01-FEB-2011 30-JUN-2011 C B5

12 3 01-FEB-2011 30-JUN-2011 C B6

13 3 01-JUN-2011 31-AUG-2011 C B7

14 3 31-AUG-2011 30-SEP-2011 C B8

15 3 01-DEC-2011 31-MAY-2012 C B9

16 3 01-DEC-2011 31-MAY-2012 C B9

17 3 01-JUN-2012 31-DEC-2012 C B10

The following figure visualizes the intervals of OID 3. The green patterns are expected to be used and the red pattern are expected to be ignored. The rationale is that the VID (version ID) is typically based on an Oracle Sequence and therefore it’s assumed that higher VIDs are newer and therefore more adequate. In real cases you may have additional information such a created_at or modified_on timestamps which helps you identifying the record to be used in conflicting situations. Since data is considered inconsistent in this scenario the cleansing strategy might be very different in real cases. However, cleansing is always the first step and in this example the highest VID has the highest priority in case of conflicts.

In cleansing step 1 we gather all potential starting intervals. We do not need all these intervals, but since we will merge intervals in the last processing step, we do not have to care about some unnecessary intervals right now.

Cleansing Step 1

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

SQL> WITH

2 o AS (

3 -- object identifier and their valid_from values

4 SELECT oid, valid_from FROM t2

5 UNION

6 SELECT oid, valid_to + 1 FROM t2

7 WHERE valid_to != DATE '9999-12-31'

8 )

9 -- main

10 SELECT * FROM o WHERE oid = 3;

OID VALID_FROM

---------- -----------

3 01-JAN-2010

3 01-APR-2010

3 01-JUN-2010

3 01-SEP-2010

3 01-OCT-2010

3 01-JAN-2011

3 01-FEB-2011

3 01-JUN-2011

3 01-JUL-2011

3 31-AUG-2011

3 01-SEP-2011

3 01-OCT-2011

3 01-DEC-2011

3 01-JUN-2012

3 01-JAN-2013

15 rows selected.

In cleansing step 2 we calculate the relevant VID for every previous result. Additionally we may inexpensively calculate if an interval is a gap.

Cleansing Step 1 and 2

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

SQL> WITH

2 o AS (

3 -- object identifier and their valid_from values

4 SELECT oid, valid_from FROM t2

5 UNION

6 SELECT oid, valid_to + 1 FROM t2

7 WHERE valid_to != DATE '9999-12-31'

8 ),

9 v AS (

10 -- relevant version identifier per valid_from

11 -- produces column is_gap

12 SELECT o.oid,

13 MAX(vid) AS vid,

14 o.valid_from,

15 NVL2(MAX(vid), 0, 1) AS is_gap

16 FROM o

17 LEFT JOIN t2

18 ON t2.oid = o.oid

19 AND o.valid_from BETWEEN t2.valid_from AND t2.valid_to

20 GROUP BY o.oid, o.valid_from

21 )

22 -- main

23 SELECT * FROM v WHERE oid = 3 ORDER BY valid_from;

OID VID VALID_FROM IS_GAP

---------- ---------- ----------- ----------

3 8 01-JAN-2010 0

3 7 01-APR-2010 0

3 9 01-JUN-2010 0

3 7 01-SEP-2010 0

3 10 01-OCT-2010 0

3 01-JAN-2011 1

3 12 01-FEB-2011 0

3 13 01-JUN-2011 0

3 13 01-JUL-2011 0

3 14 31-AUG-2011 0

3 14 01-SEP-2011 0

3 01-OCT-2011 1

3 16 01-DEC-2011 0

3 17 01-JUN-2012 0

3 01-JAN-2013 1

15 rows selected.

In cleansing step 3 we extend the previous result by the missing columns from table t2 and calculate the NEW_GROUP column with the same logic as in scenario A.

Cleansing Step 1, 2 and 3

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

SQL> WITH

2 o AS (

3 -- object identifier and their valid_from values

4 SELECT oid, valid_from FROM t2

5 UNION

6 SELECT oid, valid_to + 1 FROM t2

7 WHERE valid_to != DATE '9999-12-31'

8 ),

9 v AS (

10 -- relevant version identifier per valid_from

11 -- produces column is_gap

12 SELECT o.oid,

13 MAX(vid) AS vid,

14 o.valid_from,

15 NVL2(MAX(vid), 0, 1) AS is_gap

16 FROM o

17 LEFT JOIN t2

18 ON t2.oid = o.oid

19 AND o.valid_from BETWEEN t2.valid_from AND t2.valid_to

20 GROUP BY o.oid, o.valid_from

21 ),

22 combined AS (

23 -- combines previous intermediate result v with t2

24 -- produces the valid_to and new_group columns

25 SELECT t2.vid,

26 v.oid,

27 v.valid_from,

28 LEAD(v.valid_from - 1, 1, DATE '9999-12-31') OVER (

29 PARTITION BY v.oid ORDER BY v.valid_from

30 ) AS valid_to,

31 t2.c1,

32 t2.c2,

33 v.is_gap,

34 CASE

35 WHEN LAG(t2.c1, 1, t2.c1) OVER(

36 PARTITION BY t2.oid ORDER BY t2.valid_from

37 ) = c1 THEN

38 0

39 ELSE

40 1

41 END AS new_group

42 FROM v

43 LEFT JOIN t2

44 ON t2.oid = v.oid

45 AND t2.vid = v.vid

46 AND v.valid_from BETWEEN t2.valid_from AND t2.valid_to

47 )

48 -- main

49 SELECT * FROM combined WHERE oid = 3 ORDER BY valid_from;

VID OID VALID_FROM VALID_TO C1 C2 IS_GAP NEW_GROUP

--- --- ----------- ----------- -- --- ------ ---------

8 3 01-JAN-2010 31-MAR-2010 C B2 0 0

7 3 01-APR-2010 31-MAY-2010 C B1 0 0

9 3 01-JUN-2010 31-AUG-2010 C B3 0 0

7 3 01-SEP-2010 30-SEP-2010 C B1 0 0

10 3 01-OCT-2010 31-DEC-2010 C B4 0 0

3 01-JAN-2011 31-JAN-2011 1 1

12 3 01-FEB-2011 31-MAY-2011 C B6 0 0

13 3 01-JUN-2011 30-JUN-2011 C B7 0 0

13 3 01-JUL-2011 30-AUG-2011 C B7 0 0

14 3 31-AUG-2011 31-AUG-2011 C B8 0 0

14 3 01-SEP-2011 30-SEP-2011 C B8 0 0

3 01-OCT-2011 30-NOV-2011 1 1

16 3 01-DEC-2011 31-MAY-2012 C B9 0 0

17 3 01-JUN-2012 31-DEC-2012 C B10 0 0

3 01-JAN-2013 31-DEC-9999 1 1

15 rows selected.

Now we have cleansed the data and are ready for the final steps “calc_group” and “merge” which are very similar to scenario A. The relevant difference is the highlighted line 70 which filters non-gap records. Here is the complete statement and the query result:

Query to Merge Overlapping Intervals

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

WITH

oAS(

-- object identifier and their valid_from values

SELECToid,valid_fromFROMt2

UNION

SELECToid,valid_to+1FROMt2

WHEREvalid_to!=DATE'9999-12-31'

),

vAS(

-- relevant version identifier per valid_from

-- produces column is_gap

SELECTo.oid,

MAX(vid)ASvid,

o.valid_from,

NVL2(MAX(vid),0,1)ASis_gap

FROMo

LEFT JOINt2

ONt2.oid=o.oid

ANDo.valid_fromBETWEENt2.valid_fromANDt2.valid_to

GROUP BYo.oid,o.valid_from

),

combinedAS(

-- combines previous intermediate result v with t2

-- produces the valid_to and new_group columns

SELECTt2.vid,

v.oid,

v.valid_from,

LEAD(v.valid_from-1,1,DATE'9999-12-31')OVER(

PARTITIONBYv.oidORDERBYv.valid_from

)ASvalid_to,

t2.c1,

t2.c2,

v.is_gap,

CASE

WHENLAG(t2.c1,1,t2.c1)OVER(

PARTITIONBYt2.oidORDERBYt2.valid_from

)=c1THEN

0

ELSE

1

ENDASnew_group

FROMv

LEFT JOINt2

ONt2.oid=v.oid

ANDt2.vid=v.vid

ANDv.valid_fromBETWEENt2.valid_fromANDt2.valid_to

),

calc_groupAS(

-- produces column group_no, records with the same group_no

-- are mergeable, group_no is calculated per oid

SELECToid,

valid_from,

valid_to,

c1,

c2,

is_gap,

SUM(is_gap+new_group)OVER(

PARTITIONBYoidORDERBYoid,valid_from

)ASgroup_no

FROMcombined

),

mergedAS(

-- produces the final merged result

-- grouping by group_no ensures that gaps are honored

SELECToid,

MIN(valid_from)ASvalid_from,

MAX(valid_to)ASvalid_to,

c1

FROMcalc_group

WHEREis_gap=0

GROUP BYOID,c1,group_no

ORDERBYOID,valid_from

)

-- main

SELECT*FROMmerged;

Final Query Result

1

2

3

4

5

6

7

8

9

10

11

12

OID VALID_FROM VALID_TO C1

---------- ----------- ----------- ----------

1 01-JAN-2010 31-MAR-2011 A

1 01-JUN-2011 31-JAN-2012 A

1 01-APR-2012 31-DEC-9999 A

2 01-JAN-2010 31-DEC-9999 B

3 01-JAN-2010 31-DEC-2010 C

3 01-FEB-2011 30-SEP-2011 C

3 01-DEC-2011 31-DEC-2012 C

4 01-JAN-2010 30-SEP-2011 D

4 01-OCT-2011 30-SEP-2012

4 01-OCT-2012 31-DEC-9999 D

If you change line 70 to “WHERE is_gap = 1” you’ll get all gap records, just a way to query non-existing intervals.

All Gap Records

1

2

3

4

5

6

7

OID VALID_FROM VALID_TO C1

---------- ----------- ----------- ----------

1 01-APR-2011 31-MAY-2011

1 01-FEB-2012 31-MAR-2012

3 01-JAN-2011 31-JAN-2011

3 01-OCT-2011 30-NOV-2011

3 01-JAN-2013 31-DEC-9999

Conclusion

Merging temporal intervals is challenging, especially if the history has gaps and the data is inconsistent as in scenario B. However, the SQL engine is a powerful tool to clean up data and merge the temporal intervals efficiently in a single SQL statement.