December 7, 2009
Some time ago the following question appeared on the comp.databases.oracle.server Usenet group:
http://groups.google.com/group/comp.databases.oracle.server/browse_thread/thread/aff40e7ccf09d9cc
On Oracle 10G, lets say I have a table that looks like this:
create table t1 (key varchar2(10), start_dtm date, end_dtm date);and I have data some of which over laps:
insert into t1 values ('1234', to_date('01/01/2001', 'DD/MM/YYYY'), to_date('30/03/2001', 'DD/MM/YYYY')); insert into t1 values ('1234', to_date('01/02/2001', 'DD/MM/YYYY'), to_date('30/03/2001', 'DD/MM/YYYY')); insert into t1 values ('1234', to_date('15/02/2001', 'DD/MM/YYYY'), to_date('15/04/2001', 'DD/MM/YYYY')); insert into t1 values ('1234', to_date('01/07/2001', 'DD/MM/YYYY'), to_date('30/12/2001', 'DD/MM/YYYY')); Ie: 1st Jan - 30th March 2nd Feb - 30th March 15th Feb - 15th April 1st July - 30th DecI want to return only two rows from the 4 that look like
1st jan - 15th April 1st July - 30th Decie – if the dates are continuous, squash all the continuous rows into 1 with the min start and max end date but if there is a gap it has to become a second row. There can be any number of rows and over laps etc. I reckon this can be done with some analytics trickery, but I just cannot seem to figure out how.
Any ideas?
I provided the following solution:
First, we experiment with LAG so that we are able to examine the previous row when sorted by START_DTM and END_DTM:
SELECT T1.KEY, T1.START_DTM, T1.END_DTM, LEAD(T1.START_DTM) OVER (PARTITION BY T1.KEY ORDER BY T1.START_DTM,T1.END_DTM) M_START_DTM, LEAD(T1.END_DTM) OVER (PARTITION BY T1.KEY ORDER BY T1.START_DTM,T1.END_DTM) M_END_DTM FROM T1 ORDER BY T1.START_DTM, T1.END_DTM; KEY START_DTM END_DTM M_START_D M_END_DTM ---------- --------- --------- --------- --------- 1234 01-JAN-01 30-MAR-01 01-FEB-01 30-MAR-01 1234 01-FEB-01 30-MAR-01 15-FEB-01 15-APR-01 1234 15-FEB-01 15-APR-01 01-JUL-01 30-DEC-01 1234 01-JUL-01 30-DEC-01
If we now slide the above into an inline view and see if the START_DTM is greater than M_END_DTM (the previous row’s END_DTM), we may have found a new series of dates (note, I just noticed that this may have problems if there are two rows with discrete date ranges that fall entirely within a third, larger date range), we will output 1 if true, or 0 if false:
SELECT KEY, START_DTM, END_DTM, M_START_DTM, M_END_DTM, DECODE(SIGN(START_DTM-NVL(M_END_DTM,START_DTM)),1,1,0) C FROM (SELECT T1.KEY, T1.START_DTM, T1.END_DTM, LAG(T1.START_DTM) OVER (PARTITION BY T1.KEY ORDER BY T1.START_DTM,T1.END_DTM) M_START_DTM, LAG(T1.END_DTM) OVER (PARTITION BY T1.KEY ORDER BY T1.START_DTM,T1.END_DTM) M_END_DTM FROM T1 ORDER BY T1.START_DTM, T1.END_DTM); KEY START_DTM END_DTM M_START_D M_END_DTM C ---------- --------- --------- --------- --------- ---------- 1234 01-JAN-01 30-MAR-01 0 1234 01-FEB-01 30-MAR-01 01-JAN-01 30-MAR-01 0 1234 15-FEB-01 15-APR-01 01-FEB-01 30-MAR-01 0 1234 01-JUL-01 30-DEC-01 15-FEB-01 15-APR-01 1
In the above, a 1 is output whenever there is a jump in the date range – if we use the COUNT analytical function to create a running count of the 1s and slide the above into an inline view, we actually create a column that may be used for grouping:
SELECT KEY, START_DTM START_DTM, END_DTM END_DTM, SUM(C) OVER (PARTITION BY KEY ORDER BY START_DTM,END_DTM) G FROM (SELECT KEY, START_DTM, END_DTM, M_START_DTM, M_END_DTM, DECODE(SIGN(START_DTM-NVL(M_END_DTM,START_DTM)),1,1,0) C FROM (SELECT T1.KEY, T1.START_DTM, T1.END_DTM, LAG(T1.START_DTM) OVER (PARTITION BY T1.KEY ORDER BY T1.START_DTM,T1.END_DTM) M_START_DTM, LAG(T1.END_DTM) OVER (PARTITION BY T1.KEY ORDER BY T1.START_DTM,T1.END_DTM) M_END_DTM FROM T1 ORDER BY T1.START_DTM, T1.END_DTM)); KEY START_DTM END_DTM G ---------- --------- --------- ---------- 1234 01-JAN-01 30-MAR-01 0 1234 01-FEB-01 30-MAR-01 0 1234 15-FEB-01 15-APR-01 0 1234 01-JUL-01 30-DEC-01 1
We are now able to group on the column G by again sliding the above into an inline view:
SELECT KEY, MIN(START_DTM) START_DTM, MAX(END_DTM) END_DTM FROM (SELECT KEY, START_DTM START_DTM, END_DTM END_DTM, SUM(C) OVER (PARTITION BY KEY ORDER BY START_DTM,END_DTM) G FROM (SELECT KEY, START_DTM, END_DTM, M_START_DTM, M_END_DTM, DECODE(SIGN(START_DTM-NVL(M_END_DTM,START_DTM)),1,1,0) C FROM (SELECT T1.KEY, T1.START_DTM, T1.END_DTM, LAG(T1.START_DTM) OVER (PARTITION BY T1.KEY ORDER BY T1.START_DTM,T1.END_DTM) M_START_DTM, LAG(T1.END_DTM) OVER (PARTITION BY T1.KEY ORDER BY T1.START_DTM,T1.END_DTM) M_END_DTM FROM T1 ORDER BY T1.START_DTM, T1.END_DTM))) GROUP BY KEY, G; KEY START_DTM END_DTM ---------- --------- --------- 1234 01-JAN-01 15-APR-01 1234 01-JUL-01 30-DEC-01
Let’s add a little more data to see what happens:
insert into t1 values ('1234', to_date('10/10/2001', 'DD/MM/YYYY'), to_date('29/12/2001', 'DD/MM/YYYY'));
insert into t1 values ('1234', to_date('31/12/2001', 'DD/MM/YYYY'), to_date('15/01/2002', 'DD/MM/YYYY'));
KEY START_DTM END_DTM
---------- --------- ---------
1234 01-JAN-01 15-APR-01
1234 01-JUL-01 30-DEC-01
1234 31-DEC-01 15-JAN-02
Note the addition of the last row in the output – you may want to determine if that should be part of the previous group of date. You should be able to fix this by adding or subtracting a date in the SIGN( ) function.
—-
I originally thought that you were trying to eliminate this row from the output:
KEY START_DTM END_DTM ---------- --------- --------- 1234 01-FEB-01 30-MAR-01
For the above, there are several methods:
SELECT T1.KEY, T1.START_DTM, T1.END_DTM FROM T1 WHERE (T1.KEY, T1.START_DTM, T1.END_DTM) NOT IN ( SELECT T1.KEY, T1.START_DTM, T1.END_DTM FROM T1, T1 T2 WHERE T1.KEY=T2.KEY AND T1.START_DTM BETWEEN T2.START_DTM AND T2.END_DTM AND T1.END_DTM BETWEEN T2.START_DTM AND T2.END_DTM AND ( T1.START_DTM<>T2.START_DTM OR T1.END_DTM<>T2.END_DTM)); KEY START_DTM END_DTM ---------- --------- --------- 1234 01-JAN-01 30-MAR-01 1234 15-FEB-01 15-APR-01 1234 01-JUL-01 30-DEC-01
SELECT T1.KEY, T1.START_DTM, T1.END_DTM FROM T1, (SELECT T1.KEY, T1.START_DTM, T1.END_DTM FROM T1, T1 T2 WHERE T1.KEY=T2.KEY AND T1.START_DTM BETWEEN T2.START_DTM AND T2.END_DTM AND T1.END_DTM BETWEEN T2.START_DTM AND T2.END_DTM AND ( T1.START_DTM<>T2.START_DTM OR T1.END_DTM<>T2.END_DTM)) T3 WHERE T1.KEY=T3.KEY(+) AND T1.START_DTM=T3.START_DTM(+) AND T1.END_DTM=T3.END_DTM(+) AND T3.KEY IS NULL; KEY START_DTM END_DTM ---------- --------- --------- 1234 01-JAN-01 30-MAR-01 1234 15-FEB-01 15-APR-01 1234 01-JUL-01 30-DEC-01
SELECT T1.KEY, T1.START_DTM, T1.END_DTM FROM T1 MINUS SELECT T1.KEY, T1.START_DTM, T1.END_DTM FROM T1, T1 T2 WHERE T1.KEY=T2.KEY AND T1.START_DTM BETWEEN T2.START_DTM AND T2.END_DTM AND T1.END_DTM BETWEEN T2.START_DTM AND T2.END_DTM AND ( T1.START_DTM<>T2.START_DTM OR T1.END_DTM<>T2.END_DTM); KEY START_DTM END_DTM ---------- --------- --------- 1234 01-JAN-01 30-MAR-01 1234 15-FEB-01 15-APR-01 1234 01-JUL-01 30-DEC-01
SELECT KEY, START_DTM, END_DTM FROM (SELECT T1.KEY, T1.START_DTM, T1.END_DTM, MIN(T1.START_DTM) OVER (PARTITION BY T1.KEY, T1.END_DTM) M_START_DTM, MAX(T1.END_DTM) OVER (PARTITION BY T1.KEY, T1.START_DTM) M_END_DTM FROM T1) WHERE START_DTM=M_START_DTM AND END_DTM=M_END_DTM; KEY START_DTM END_DTM ---------- --------- --------- 1234 01-JAN-01 30-MAR-01 1234 15-FEB-01 15-APR-01 1234 01-JUL-01 30-DEC-01
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Maxim Demenko provided the following solution:
Maybe, something like this
SQL> select key, min(start_dtm) start_dtm, max(end_dtm) end_dtm 2 from (select key, 3 start_dtm, 4 end_dtm, 5 sum(interval_change) over(partition by key order by end_dtm, start_dtm) interval_no 6 from (select key, 7 start_dtm, 8 end_dtm, 9 case 10 when (lag(end_dtm) 11 over(partition by key order by end_dtm, 12 start_dtm) - start_dtm) > 0 then 13 0 14 else 15 1 16 end interval_change 17 from t1)) 18 group by key, interval_no 19 order by key, interval_no 20 ; KEY START_DTM END_DTM ---------- -------------------- -------------------- 1234 01-Jan-2001 15-Apr-2001 1234 01-Jul-2001 30-Dec-2001
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The original poster followed up with the following solution that he developed:
Using the same table as above and this data:
create table t1 (key varchar2(10), start_dtm date, end_dtm date);
insert into t1 values( 9999, to_date('01/01/2008', 'DD/MM/YYYY'), to_date('01/06/2008', 'DD/MM/YYYY'));
insert into t1 values( 9999, to_date('01/02/2008', 'DD/MM/YYYY'), to_date('01/06/2008', 'DD/MM/YYYY'));
insert into t1 values( 9999, to_date('01/02/2008', 'DD/MM/YYYY'), to_date('01/05/2008', 'DD/MM/YYYY'));
insert into t1 values( 9999, to_date('01/02/2008', 'DD/MM/YYYY'), to_date('01/07/2008', 'DD/MM/YYYY'));
This gives a set of rows like:
|------------------------------------------| |---------------------------------------| |---------------------| |------------------------------------------|
And we expect this to turn into a single row like
01/01/08 |----------------------------------------------| 01/07/2008
insert into t1 values( 9999, to_date('01/08/2008', 'DD/MM/YYYY'), null);
insert into t1 values( 9999, to_date('01/09/2008', 'DD/MM/YYYY'), to_date('01/10/2008', 'DD/MM/YYYY'));
This gives a set of rows like:
|---------------------------------------> |------------|
And we want an row that looks like
01/08/2008 |----------------------------> (null end date)
I think this query does it:
select key, min(start_dtm), max(end_dtm) from ( select key, start_dtm, end_dtm, max (grp) over (partition by key order by start_dtm asc) grp2 from ( SELECT key, start_dtm, NVL (end_dtm, '31-dec-4712') end_dtm, case when (start_dtm not between lag(start_dtm) over (partition by key order by start_dtm asc) and nvl( lag(end_dtm) over (partition by key order by start_dtm asc), '31-dec-4712') ) or lag(start_dtm) over (partition by key order by start_dtm asc) is null then row_number() over (partition by key order by start_dtm asc) end grp FROM t1 order by key, start_dtm asc ) a ) b group by key, grp2 GRP MIN(START MAX(END_D --------------------------------- --------- --------- 09999 01-JAN-08 01-JUL-08 09999 01-AUG-08 31-DEC-12
Basically I said, if you order the rows by increasing start date, then if rows overlap, the current start_dtm must be between the previous start_dtm and end_dtm – if there is no previous start_dtm or they don’t overlap, then its a new group.
The innermost select outputs something like:
KEY START_DTM END_DTM GRP --------------------------------- --------- --------- ---------- 09999 01-JAN-08 01-JUN-08 1 09999 01-FEB-08 01-JUL-08 09999 01-FEB-08 01-JUN-08 09999 01-FEB-08 01-MAY-08 09999 01-AUG-08 31-DEC-12 5 09999 01-SEP-08 01-OCT-08
Then we use analytics again to fill in the blanks in the enclosing query (b):
KEY START_DTM END_DTM GRP2 --------------------------------- --------- --------- ---------- 09999 01-JAN-08 01-JUN-08 1 09999 01-FEB-08 01-JUL-08 1 09999 01-FEB-08 01-JUN-08 1 09999 01-FEB-08 01-MAY-08 1 09999 01-AUG-08 31-DEC-12 5 09999 01-SEP-08 01-OCT-08 5
and then simply group by KEY, GRP in the outer query to get the result.
I am glad I went through the pain of figuring this out (took me a good 90 minutes) as I really feel like I get analytics now. It also allowed me to replace a temporary table and several 100 lines of PLSQL that was performing rather poorly and written by someone who doesn’t get analytics!

Recent Comments