SQL – Combining Overlapping Date Rows

8 12 2009

December 7, 2009

Some time ago the following question appeared on the comp.databases.oracle.server Usenet group:
http://groups.google.com/group/comp.databases.oracle.server/browse_thread/thread/aff40e7ccf09d9cc

On Oracle 10G, lets say I have a table that looks like this:

create table t1 (key varchar2(10), start_dtm date, end_dtm date);

and I have data some of which over laps:

insert into t1 values ('1234', to_date('01/01/2001', 'DD/MM/YYYY'), to_date('30/03/2001', 'DD/MM/YYYY'));
insert into t1 values ('1234', to_date('01/02/2001', 'DD/MM/YYYY'), to_date('30/03/2001', 'DD/MM/YYYY'));
insert into t1 values ('1234', to_date('15/02/2001', 'DD/MM/YYYY'), to_date('15/04/2001', 'DD/MM/YYYY'));
insert into t1 values ('1234', to_date('01/07/2001', 'DD/MM/YYYY'), to_date('30/12/2001', 'DD/MM/YYYY'));
Ie:

1st Jan - 30th March
    2nd Feb - 30th March
       15th Feb - 15th April

1st July - 30th Dec

I want to return only two rows from the 4 that look like

1st jan - 15th April
1st July - 30th Dec

ie - if the dates are continuous, squash all the continuous rows into 1 with the min start and max end date but if there is a gap it has to become a second row.  There can be any number of rows and over laps etc. I reckon this can be done with some analytics trickery, but I just cannot seem to figure out how.

Any ideas?

I provided the following solution:

First, we experiment with LAG so that we are able to examine the previous row when sorted by START_DTM and END_DTM:

SELECT
  T1.KEY,
  T1.START_DTM,
  T1.END_DTM,
  LEAD(T1.START_DTM) OVER (PARTITION BY T1.KEY ORDER BY T1.START_DTM,T1.END_DTM) M_START_DTM,
  LEAD(T1.END_DTM) OVER (PARTITION BY T1.KEY ORDER BY T1.START_DTM,T1.END_DTM) M_END_DTM
FROM
  T1
ORDER BY
  T1.START_DTM,
  T1.END_DTM;

KEY        START_DTM END_DTM   M_START_D M_END_DTM
---------- --------- --------- --------- ---------
1234       01-JAN-01 30-MAR-01 01-FEB-01 30-MAR-01
1234       01-FEB-01 30-MAR-01 15-FEB-01 15-APR-01
1234       15-FEB-01 15-APR-01 01-JUL-01 30-DEC-01
1234       01-JUL-01 30-DEC-01

If we now slide the above into an inline view and see if the START_DTM is greater than M_END_DTM (the previous row's END_DTM), we may have found a new series of dates (note, I just noticed that this may have problems if there are two rows with discrete date ranges that fall entirely within a third, larger date range), we will output 1 if true, or 0 if false:

SELECT
  KEY,
  START_DTM,
  END_DTM,
  M_START_DTM,
  M_END_DTM,
  DECODE(SIGN(START_DTM-NVL(M_END_DTM,START_DTM)),1,1,0) C
FROM
  (SELECT
    T1.KEY,
    T1.START_DTM,
    T1.END_DTM,
    LAG(T1.START_DTM) OVER (PARTITION BY T1.KEY ORDER BY T1.START_DTM,T1.END_DTM) M_START_DTM,
    LAG(T1.END_DTM) OVER (PARTITION BY T1.KEY ORDER BY T1.START_DTM,T1.END_DTM) M_END_DTM
  FROM
    T1
  ORDER BY
    T1.START_DTM,
    T1.END_DTM);

KEY        START_DTM END_DTM   M_START_D M_END_DTM          C
---------- --------- --------- --------- --------- ----------
1234       01-JAN-01 30-MAR-01                              0
1234       01-FEB-01 30-MAR-01 01-JAN-01 30-MAR-01          0
1234       15-FEB-01 15-APR-01 01-FEB-01 30-MAR-01          0
1234       01-JUL-01 30-DEC-01 15-FEB-01 15-APR-01          1

In the above, a 1 is output whenever there is a jump in the date range - if we use the COUNT analytical function to create a running count of the 1s and slide the above into an inline view, we actually create a column that may be used for grouping:

SELECT
  KEY,
  START_DTM START_DTM,
  END_DTM END_DTM,
  SUM(C) OVER (PARTITION BY KEY ORDER BY START_DTM,END_DTM) G
FROM
(SELECT
  KEY,
  START_DTM,
  END_DTM,
  M_START_DTM,
  M_END_DTM,
  DECODE(SIGN(START_DTM-NVL(M_END_DTM,START_DTM)),1,1,0) C
FROM
  (SELECT
    T1.KEY,
    T1.START_DTM,
    T1.END_DTM,
    LAG(T1.START_DTM) OVER (PARTITION BY T1.KEY ORDER BY T1.START_DTM,T1.END_DTM) M_START_DTM,
    LAG(T1.END_DTM) OVER (PARTITION BY T1.KEY ORDER BY T1.START_DTM,T1.END_DTM) M_END_DTM
  FROM
    T1
  ORDER BY
    T1.START_DTM,
    T1.END_DTM));

KEY        START_DTM END_DTM            G
---------- --------- --------- ----------
1234       01-JAN-01 30-MAR-01          0
1234       01-FEB-01 30-MAR-01          0
1234       15-FEB-01 15-APR-01          0
1234       01-JUL-01 30-DEC-01          1

We are now able to group on the column G by again sliding the above into an inline view:

SELECT
  KEY,
  MIN(START_DTM) START_DTM,
  MAX(END_DTM) END_DTM
FROM
(SELECT
  KEY,
  START_DTM START_DTM,
  END_DTM END_DTM,
  SUM(C) OVER (PARTITION BY KEY ORDER BY START_DTM,END_DTM) G
FROM
(SELECT
  KEY,
  START_DTM,
  END_DTM,
  M_START_DTM,
  M_END_DTM,
  DECODE(SIGN(START_DTM-NVL(M_END_DTM,START_DTM)),1,1,0) C
FROM
  (SELECT
    T1.KEY,
    T1.START_DTM,
    T1.END_DTM,
    LAG(T1.START_DTM) OVER (PARTITION BY T1.KEY ORDER BY T1.START_DTM,T1.END_DTM) M_START_DTM,
    LAG(T1.END_DTM) OVER (PARTITION BY T1.KEY ORDER BY T1.START_DTM,T1.END_DTM) M_END_DTM
  FROM
    T1
  ORDER BY
    T1.START_DTM,
    T1.END_DTM)))
GROUP BY
  KEY,
  G;

KEY        START_DTM END_DTM
---------- --------- ---------
1234       01-JAN-01 15-APR-01
1234       01-JUL-01 30-DEC-01

Let's add a little more data to see what happens:

insert into t1 values ('1234', to_date('10/10/2001', 'DD/MM/YYYY'), to_date('29/12/2001', 'DD/MM/YYYY'));
insert into t1 values ('1234', to_date('31/12/2001', 'DD/MM/YYYY'), to_date('15/01/2002', 'DD/MM/YYYY'));

KEY        START_DTM END_DTM
---------- --------- ---------
1234       01-JAN-01 15-APR-01
1234       01-JUL-01 30-DEC-01
1234       31-DEC-01 15-JAN-02

Note the addition of the last row in the output - you may want to determine if that should be part of the previous group of date.  You should be able to fix this by adding or subtracting a date in the SIGN( ) function.

----

I originally thought that you were trying to eliminate this row from the output:

KEY        START_DTM END_DTM
---------- --------- ---------
1234       01-FEB-01 30-MAR-01

For the above, there are several methods:

SELECT
  T1.KEY,
  T1.START_DTM,
  T1.END_DTM
FROM
  T1
WHERE
  (T1.KEY,
  T1.START_DTM,
  T1.END_DTM) NOT IN (
    SELECT
      T1.KEY,
      T1.START_DTM,
      T1.END_DTM
    FROM
      T1,
      T1 T2
    WHERE
      T1.KEY=T2.KEY
      AND T1.START_DTM BETWEEN T2.START_DTM AND T2.END_DTM
      AND T1.END_DTM BETWEEN  T2.START_DTM AND T2.END_DTM
      AND (
        T1.START_DTM<>T2.START_DTM
        OR T1.END_DTM<>T2.END_DTM));

KEY        START_DTM END_DTM
---------- --------- ---------
1234       01-JAN-01 30-MAR-01
1234       15-FEB-01 15-APR-01
1234       01-JUL-01 30-DEC-01

 

SELECT
  T1.KEY,
  T1.START_DTM,
  T1.END_DTM
FROM
  T1,
  (SELECT
    T1.KEY,
    T1.START_DTM,
    T1.END_DTM
  FROM
    T1,
    T1 T2
  WHERE
    T1.KEY=T2.KEY
    AND T1.START_DTM BETWEEN T2.START_DTM AND T2.END_DTM
    AND T1.END_DTM BETWEEN  T2.START_DTM AND T2.END_DTM
    AND (
      T1.START_DTM<>T2.START_DTM
      OR T1.END_DTM<>T2.END_DTM)) T3
WHERE
  T1.KEY=T3.KEY(+)
  AND T1.START_DTM=T3.START_DTM(+)
  AND T1.END_DTM=T3.END_DTM(+)
  AND T3.KEY IS NULL;

KEY        START_DTM END_DTM
---------- --------- ---------
1234       01-JAN-01 30-MAR-01
1234       15-FEB-01 15-APR-01
1234       01-JUL-01 30-DEC-01

 

SELECT
  T1.KEY,
  T1.START_DTM,
  T1.END_DTM
FROM
  T1
MINUS
SELECT
  T1.KEY,
  T1.START_DTM,
  T1.END_DTM
FROM
  T1,
  T1 T2
WHERE
  T1.KEY=T2.KEY
  AND T1.START_DTM BETWEEN T2.START_DTM AND T2.END_DTM
  AND T1.END_DTM BETWEEN  T2.START_DTM AND T2.END_DTM
  AND (
    T1.START_DTM<>T2.START_DTM
    OR T1.END_DTM<>T2.END_DTM);

KEY        START_DTM END_DTM
---------- --------- ---------
1234       01-JAN-01 30-MAR-01
1234       15-FEB-01 15-APR-01
1234       01-JUL-01 30-DEC-01

 

SELECT
  KEY,
  START_DTM,
  END_DTM
FROM
  (SELECT
    T1.KEY,
    T1.START_DTM,
    T1.END_DTM,
    MIN(T1.START_DTM) OVER (PARTITION BY T1.KEY, T1.END_DTM) M_START_DTM,
    MAX(T1.END_DTM) OVER  (PARTITION BY T1.KEY, T1.START_DTM) M_END_DTM
  FROM
    T1)
WHERE
  START_DTM=M_START_DTM
  AND END_DTM=M_END_DTM;

KEY        START_DTM END_DTM
---------- --------- ---------
1234       01-JAN-01 30-MAR-01
1234       15-FEB-01 15-APR-01
1234       01-JUL-01 30-DEC-01

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Maxim Demenko provided the following solution:

Maybe, something like this

SQL> select key, min(start_dtm) start_dtm, max(end_dtm) end_dtm
   2    from (select key,
   3                 start_dtm,
   4                 end_dtm,
   5                 sum(interval_change) over(partition by key order by end_dtm, start_dtm) interval_no
   6            from (select key,
   7                         start_dtm,
   8                         end_dtm,
   9                         case
  10                           when (lag(end_dtm)
  11                                 over(partition by key order by end_dtm,
  12                                      start_dtm) - start_dtm) > 0 then
  13                            0
  14                           else
  15                            1
  16                         end interval_change
  17                    from t1))
  18   group by key, interval_no
  19   order by key, interval_no
  20  ;

KEY        START_DTM            END_DTM
---------- -------------------- --------------------
1234       01-Jan-2001          15-Apr-2001
1234       01-Jul-2001          30-Dec-2001

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The original poster followed up with the following solution that he developed:

Using the same table as above and this data:

create table t1 (key varchar2(10), start_dtm date, end_dtm date);

insert into t1 values( 9999, to_date('01/01/2008', 'DD/MM/YYYY'), to_date('01/06/2008', 'DD/MM/YYYY'));
insert into t1 values( 9999, to_date('01/02/2008', 'DD/MM/YYYY'), to_date('01/06/2008', 'DD/MM/YYYY'));
insert into t1 values( 9999, to_date('01/02/2008', 'DD/MM/YYYY'), to_date('01/05/2008', 'DD/MM/YYYY'));
insert into t1 values( 9999, to_date('01/02/2008', 'DD/MM/YYYY'), to_date('01/07/2008', 'DD/MM/YYYY'));

This gives a set of rows like:

|------------------------------------------|
   |---------------------------------------|
   |---------------------|
      |------------------------------------------|

And we expect this to turn into a single row like

01/01/08 |----------------------------------------------| 01/07/2008

 

insert into t1 values( 9999, to_date('01/08/2008', 'DD/MM/YYYY'), null);
insert into t1 values( 9999, to_date('01/09/2008', 'DD/MM/YYYY'), to_date('01/10/2008', 'DD/MM/YYYY'));

This gives a set of rows like:

|--------------------------------------->
   |------------|

And we want an row that looks like

01/08/2008 |----------------------------> (null end date)

I think this query does it:

select key, min(start_dtm), max(end_dtm)
from
(
  select key, start_dtm, end_dtm, max (grp) over (partition by key order by start_dtm asc) grp2
  from
  (
    SELECT key,
           start_dtm,
           NVL (end_dtm, '31-dec-4712') end_dtm,
           case
             when
               (start_dtm not between lag(start_dtm) over (partition by key order by start_dtm asc)
                   and nvl( lag(end_dtm) over (partition by key order by start_dtm asc), '31-dec-4712') )
                or lag(start_dtm) over (partition by key order by start_dtm asc) is null
                then
                  row_number() over (partition by key order by start_dtm asc)
             end grp
    FROM t1
    order by key, start_dtm asc
  ) a
) b
group by key, grp2

GRP                              MIN(START MAX(END_D
--------------------------------- --------- ---------
09999                             01-JAN-08 01-JUL-08
09999                             01-AUG-08 31-DEC-12

Basically I said, if you order the rows by increasing start date, then if rows overlap, the current start_dtm must be between the previous start_dtm and end_dtm - if there is no previous start_dtm or they don't overlap, then its a new group.

The innermost select outputs something like:

KEY                               START_DTM END_DTM          GRP
--------------------------------- --------- --------- ----------
09999                             01-JAN-08 01-JUN-08          1
09999                             01-FEB-08 01-JUL-08
09999                             01-FEB-08 01-JUN-08
09999                             01-FEB-08 01-MAY-08
09999                             01-AUG-08 31-DEC-12          5
09999                             01-SEP-08 01-OCT-08

Then we use analytics again to fill in the blanks in the enclosing query (b):

KEY                               START_DTM END_DTM         GRP2
--------------------------------- --------- --------- ----------
09999                             01-JAN-08 01-JUN-08          1
09999                             01-FEB-08 01-JUL-08          1
09999                             01-FEB-08 01-JUN-08          1
09999                             01-FEB-08 01-MAY-08          1
09999                             01-AUG-08 31-DEC-12          5
09999                             01-SEP-08 01-OCT-08          5

and then simply group by KEY, GRP in the outer query to get the result.

I am glad I went through the pain of figuring this out (took me a good 90 minutes) as I really feel like I get analytics now.  It also allowed me to replace a temporary table and several 100 lines of PLSQL that was performing rather poorly and written by someone who doesn't get analytics!


Actions

Information

2 responses

16 07 2014
celauritz

Would you need to change anything if the dates didn’t overlap, but were sequential?
Like Start Date 1/1/14, end date 1/31/14, Start Date 2/1/14 End Date 2/11/14.
I would like this bounced to one record: Start Date 1/1/14 End Date 2/11/14.

16 07 2014
Charles Hooper

celauritz,
It has been several years since I created this example. My first thought is that you need to create the overlap – you could do that by adding one day to the ending date in the formula, so the 1/31/14 ending date would appear as 2/1/14 in the formula, thus causing an overlap situation.

Let’s set up a test table using the SQL statements at the start of this article (note that the dates in those SQL statements are in the format of dd/mm/yyyy, rather than mm/dd/yyyy):

create table t1 (key varchar2(10), start_dtm date, end_dtm date);
 
insert into t1 values ('1234', to_date('01/01/2014', 'DD/MM/YYYY'), to_date('31/01/2014', 'DD/MM/YYYY'));
insert into t1 values ('1234', to_date('01/02/2014', 'DD/MM/YYYY'), to_date('11/02/2014', 'DD/MM/YYYY'));
insert into t1 values ('1234', to_date('15/02/2014', 'DD/MM/YYYY'), to_date('15/04/2014', 'DD/MM/YYYY'));
insert into t1 values ('1234', to_date('01/07/2014', 'DD/MM/YYYY'), to_date('30/12/2014', 'DD/MM/YYYY'));
 
COMMIT;

Starting SQL statement, let's see what dates are in the table:

SELECT
  T1.KEY,
  T1.START_DTM,
  T1.END_DTM,
  LEAD(T1.START_DTM) OVER (PARTITION BY T1.KEY ORDER BY T1.START_DTM,T1.END_DTM) M_START_DTM,
  LEAD(T1.END_DTM) OVER (PARTITION BY T1.KEY ORDER BY T1.START_DTM,T1.END_DTM) M_END_DTM
FROM
  T1
ORDER BY
  T1.START_DTM,
  T1.END_DTM;
 
KEY        START_DTM END_DTM   M_START_D M_END_DTM
---------- --------- --------- --------- ---------
1234       01-JAN-14 31-JAN-14 01-FEB-14 11-FEB-14
1234       01-FEB-14 11-FEB-14 15-FEB-14 15-APR-14
1234       15-FEB-14 15-APR-14 01-JUL-14 30-DEC-14
1234       01-JUL-14 30-DEC-14

You want the first two rows to be combined. Let's see what happens when the next SQL statement in the blog article is executed:

SELECT
  KEY,
  START_DTM,
  END_DTM,
  M_START_DTM,
  M_END_DTM,
  DECODE(SIGN(START_DTM-NVL(M_END_DTM,START_DTM)),1,1,0) C
FROM
  (SELECT
    T1.KEY,
    T1.START_DTM,
    T1.END_DTM,
    LAG(T1.START_DTM) OVER (PARTITION BY T1.KEY ORDER BY T1.START_DTM,T1.END_DTM) M_START_DTM,
    LAG(T1.END_DTM) OVER (PARTITION BY T1.KEY ORDER BY T1.START_DTM,T1.END_DTM) M_END_DTM
  FROM
    T1
  ORDER BY
    T1.START_DTM,
    T1.END_DTM);
 
KEY        START_DTM END_DTM   M_START_D M_END_DTM          C
---------- --------- --------- --------- --------- ----------
1234       01-JAN-14 31-JAN-14                              0
1234       01-FEB-14 11-FEB-14 01-JAN-14 31-JAN-14          1
1234       15-FEB-14 15-APR-14 01-FEB-14 11-FEB-14          1
1234       01-JUL-14 30-DEC-14 15-FEB-14 15-APR-14          1

The notes in the article state: "In the above, a 1 is output whenever there is a jump in the date range – if we use the COUNT analytical function to create a running count of the 1s and slide the above into an inline view, we actually create a column that may be used for grouping." Based on that note, we need to fix the SQL statement above so that a 0 is output on the second row of the output. 1 must be added to the previous end date (M_END_DTM) to allow it to overlap the start date:

SELECT
  KEY,
  START_DTM,
  END_DTM,
  M_START_DTM,
  M_END_DTM,
  DECODE(SIGN(START_DTM-NVL(M_END_DTM + 1,START_DTM)),1,1,0) C
FROM
  (SELECT
    T1.KEY,
    T1.START_DTM,
    T1.END_DTM,
    LAG(T1.START_DTM) OVER (PARTITION BY T1.KEY ORDER BY T1.START_DTM,T1.END_DTM) M_START_DTM,
    LAG(T1.END_DTM) OVER (PARTITION BY T1.KEY ORDER BY T1.START_DTM,T1.END_DTM) M_END_DTM
  FROM
    T1
  ORDER BY
    T1.START_DTM,
    T1.END_DTM);
 
KEY        START_DTM END_DTM   M_START_D M_END_DTM          C
---------- --------- --------- --------- --------- ----------
1234       01-JAN-14 31-JAN-14                              0
1234       01-FEB-14 11-FEB-14 01-JAN-14 31-JAN-14          0
1234       15-FEB-14 15-APR-14 01-FEB-14 11-FEB-14          1
1234       01-JUL-14 30-DEC-14 15-FEB-14 15-APR-14          1

Next stage, carrying that one change forward:

SELECT
  KEY,
  START_DTM START_DTM,
  END_DTM END_DTM,
  SUM(C) OVER (PARTITION BY KEY ORDER BY START_DTM,END_DTM) G
FROM
(SELECT
  KEY,
  START_DTM,
  END_DTM,
  M_START_DTM,
  M_END_DTM,
  DECODE(SIGN(START_DTM-NVL(M_END_DTM + 1,START_DTM)),1,1,0) C
FROM
  (SELECT
    T1.KEY,
    T1.START_DTM,
    T1.END_DTM,
    LAG(T1.START_DTM) OVER (PARTITION BY T1.KEY ORDER BY T1.START_DTM,T1.END_DTM) M_START_DTM,
    LAG(T1.END_DTM) OVER (PARTITION BY T1.KEY ORDER BY T1.START_DTM,T1.END_DTM) M_END_DTM
  FROM
    T1
  ORDER BY
    T1.START_DTM,
    T1.END_DTM));
 
KEY        START_DTM END_DTM            G
---------- --------- --------- ----------
1234       01-JAN-14 31-JAN-14          0
1234       01-FEB-14 11-FEB-14          0
1234       15-FEB-14 15-APR-14          1
1234       01-JUL-14 30-DEC-14          2

Next stage, again with the one change carried forward:

SELECT
  KEY,
  MIN(START_DTM) START_DTM,
  MAX(END_DTM) END_DTM
FROM
(SELECT
  KEY,
  START_DTM START_DTM,
  END_DTM END_DTM,
  SUM(C) OVER (PARTITION BY KEY ORDER BY START_DTM,END_DTM) G
FROM
(SELECT
  KEY,
  START_DTM,
  END_DTM,
  M_START_DTM,
  M_END_DTM,
  DECODE(SIGN(START_DTM-NVL(M_END_DTM + 1,START_DTM)),1,1,0) C
FROM
  (SELECT
    T1.KEY,
    T1.START_DTM,
    T1.END_DTM,
    LAG(T1.START_DTM) OVER (PARTITION BY T1.KEY ORDER BY T1.START_DTM,T1.END_DTM) M_START_DTM,
    LAG(T1.END_DTM) OVER (PARTITION BY T1.KEY ORDER BY T1.START_DTM,T1.END_DTM) M_END_DTM
  FROM
    T1
  ORDER BY
    T1.START_DTM,
    T1.END_DTM)))
GROUP BY
  KEY,
  G;
 
KEY        START_DTM END_DTM
---------- --------- ---------
1234       01-JAN-14 11-FEB-14
1234       15-FEB-14 15-APR-14
1234       01-JUL-14 30-DEC-14

I believe the above output is what you requested.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s




Follow

Get every new post delivered to your Inbox.

Join 139 other followers

%d bloggers like this: