December 7, 2009
Some time ago the following question appeared on the comp.databases.oracle.server Usenet group:
http://groups.google.com/group/comp.databases.oracle.server/browse_thread/thread/aff40e7ccf09d9cc
On Oracle 10G, lets say I have a table that looks like this:
create table t1 (key varchar2(10), start_dtm date, end_dtm date);and I have data some of which over laps:
insert into t1 values ('1234', to_date('01/01/2001', 'DD/MM/YYYY'), to_date('30/03/2001', 'DD/MM/YYYY')); insert into t1 values ('1234', to_date('01/02/2001', 'DD/MM/YYYY'), to_date('30/03/2001', 'DD/MM/YYYY')); insert into t1 values ('1234', to_date('15/02/2001', 'DD/MM/YYYY'), to_date('15/04/2001', 'DD/MM/YYYY')); insert into t1 values ('1234', to_date('01/07/2001', 'DD/MM/YYYY'), to_date('30/12/2001', 'DD/MM/YYYY')); Ie: 1st Jan - 30th March 2nd Feb - 30th March 15th Feb - 15th April 1st July - 30th DecI want to return only two rows from the 4 that look like
1st jan - 15th April 1st July - 30th Decie – if the dates are continuous, squash all the continuous rows into 1 with the min start and max end date but if there is a gap it has to become a second row. There can be any number of rows and over laps etc. I reckon this can be done with some analytics trickery, but I just cannot seem to figure out how.
Any ideas?
I provided the following solution:
First, we experiment with LAG so that we are able to examine the previous row when sorted by START_DTM and END_DTM:
SELECT T1.KEY, T1.START_DTM, T1.END_DTM, LEAD(T1.START_DTM) OVER (PARTITION BY T1.KEY ORDER BY T1.START_DTM,T1.END_DTM) M_START_DTM, LEAD(T1.END_DTM) OVER (PARTITION BY T1.KEY ORDER BY T1.START_DTM,T1.END_DTM) M_END_DTM FROM T1 ORDER BY T1.START_DTM, T1.END_DTM; KEY START_DTM END_DTM M_START_D M_END_DTM ---------- --------- --------- --------- --------- 1234 01-JAN-01 30-MAR-01 01-FEB-01 30-MAR-01 1234 01-FEB-01 30-MAR-01 15-FEB-01 15-APR-01 1234 15-FEB-01 15-APR-01 01-JUL-01 30-DEC-01 1234 01-JUL-01 30-DEC-01
If we now slide the above into an inline view and see if the START_DTM is greater than M_END_DTM (the previous row’s END_DTM), we may have found a new series of dates (note, I just noticed that this may have problems if there are two rows with discrete date ranges that fall entirely within a third, larger date range), we will output 1 if true, or 0 if false:
SELECT KEY, START_DTM, END_DTM, M_START_DTM, M_END_DTM, DECODE(SIGN(START_DTM-NVL(M_END_DTM,START_DTM)),1,1,0) C FROM (SELECT T1.KEY, T1.START_DTM, T1.END_DTM, LAG(T1.START_DTM) OVER (PARTITION BY T1.KEY ORDER BY T1.START_DTM,T1.END_DTM) M_START_DTM, LAG(T1.END_DTM) OVER (PARTITION BY T1.KEY ORDER BY T1.START_DTM,T1.END_DTM) M_END_DTM FROM T1 ORDER BY T1.START_DTM, T1.END_DTM); KEY START_DTM END_DTM M_START_D M_END_DTM C ---------- --------- --------- --------- --------- ---------- 1234 01-JAN-01 30-MAR-01 0 1234 01-FEB-01 30-MAR-01 01-JAN-01 30-MAR-01 0 1234 15-FEB-01 15-APR-01 01-FEB-01 30-MAR-01 0 1234 01-JUL-01 30-DEC-01 15-FEB-01 15-APR-01 1
In the above, a 1 is output whenever there is a jump in the date range – if we use the COUNT analytical function to create a running count of the 1s and slide the above into an inline view, we actually create a column that may be used for grouping:
SELECT KEY, START_DTM START_DTM, END_DTM END_DTM, SUM(C) OVER (PARTITION BY KEY ORDER BY START_DTM,END_DTM) G FROM (SELECT KEY, START_DTM, END_DTM, M_START_DTM, M_END_DTM, DECODE(SIGN(START_DTM-NVL(M_END_DTM,START_DTM)),1,1,0) C FROM (SELECT T1.KEY, T1.START_DTM, T1.END_DTM, LAG(T1.START_DTM) OVER (PARTITION BY T1.KEY ORDER BY T1.START_DTM,T1.END_DTM) M_START_DTM, LAG(T1.END_DTM) OVER (PARTITION BY T1.KEY ORDER BY T1.START_DTM,T1.END_DTM) M_END_DTM FROM T1 ORDER BY T1.START_DTM, T1.END_DTM)); KEY START_DTM END_DTM G ---------- --------- --------- ---------- 1234 01-JAN-01 30-MAR-01 0 1234 01-FEB-01 30-MAR-01 0 1234 15-FEB-01 15-APR-01 0 1234 01-JUL-01 30-DEC-01 1
We are now able to group on the column G by again sliding the above into an inline view:
SELECT KEY, MIN(START_DTM) START_DTM, MAX(END_DTM) END_DTM FROM (SELECT KEY, START_DTM START_DTM, END_DTM END_DTM, SUM(C) OVER (PARTITION BY KEY ORDER BY START_DTM,END_DTM) G FROM (SELECT KEY, START_DTM, END_DTM, M_START_DTM, M_END_DTM, DECODE(SIGN(START_DTM-NVL(M_END_DTM,START_DTM)),1,1,0) C FROM (SELECT T1.KEY, T1.START_DTM, T1.END_DTM, LAG(T1.START_DTM) OVER (PARTITION BY T1.KEY ORDER BY T1.START_DTM,T1.END_DTM) M_START_DTM, LAG(T1.END_DTM) OVER (PARTITION BY T1.KEY ORDER BY T1.START_DTM,T1.END_DTM) M_END_DTM FROM T1 ORDER BY T1.START_DTM, T1.END_DTM))) GROUP BY KEY, G; KEY START_DTM END_DTM ---------- --------- --------- 1234 01-JAN-01 15-APR-01 1234 01-JUL-01 30-DEC-01
Let’s add a little more data to see what happens:
insert into t1 values ('1234', to_date('10/10/2001', 'DD/MM/YYYY'), to_date('29/12/2001', 'DD/MM/YYYY')); insert into t1 values ('1234', to_date('31/12/2001', 'DD/MM/YYYY'), to_date('15/01/2002', 'DD/MM/YYYY')); KEY START_DTM END_DTM ---------- --------- --------- 1234 01-JAN-01 15-APR-01 1234 01-JUL-01 30-DEC-01 1234 31-DEC-01 15-JAN-02
Note the addition of the last row in the output – you may want to determine if that should be part of the previous group of date. You should be able to fix this by adding or subtracting a date in the SIGN( ) function.
—-
I originally thought that you were trying to eliminate this row from the output:
KEY START_DTM END_DTM ---------- --------- --------- 1234 01-FEB-01 30-MAR-01
For the above, there are several methods:
SELECT T1.KEY, T1.START_DTM, T1.END_DTM FROM T1 WHERE (T1.KEY, T1.START_DTM, T1.END_DTM) NOT IN ( SELECT T1.KEY, T1.START_DTM, T1.END_DTM FROM T1, T1 T2 WHERE T1.KEY=T2.KEY AND T1.START_DTM BETWEEN T2.START_DTM AND T2.END_DTM AND T1.END_DTM BETWEEN T2.START_DTM AND T2.END_DTM AND ( T1.START_DTM<>T2.START_DTM OR T1.END_DTM<>T2.END_DTM)); KEY START_DTM END_DTM ---------- --------- --------- 1234 01-JAN-01 30-MAR-01 1234 15-FEB-01 15-APR-01 1234 01-JUL-01 30-DEC-01
SELECT T1.KEY, T1.START_DTM, T1.END_DTM FROM T1, (SELECT T1.KEY, T1.START_DTM, T1.END_DTM FROM T1, T1 T2 WHERE T1.KEY=T2.KEY AND T1.START_DTM BETWEEN T2.START_DTM AND T2.END_DTM AND T1.END_DTM BETWEEN T2.START_DTM AND T2.END_DTM AND ( T1.START_DTM<>T2.START_DTM OR T1.END_DTM<>T2.END_DTM)) T3 WHERE T1.KEY=T3.KEY(+) AND T1.START_DTM=T3.START_DTM(+) AND T1.END_DTM=T3.END_DTM(+) AND T3.KEY IS NULL; KEY START_DTM END_DTM ---------- --------- --------- 1234 01-JAN-01 30-MAR-01 1234 15-FEB-01 15-APR-01 1234 01-JUL-01 30-DEC-01
SELECT T1.KEY, T1.START_DTM, T1.END_DTM FROM T1 MINUS SELECT T1.KEY, T1.START_DTM, T1.END_DTM FROM T1, T1 T2 WHERE T1.KEY=T2.KEY AND T1.START_DTM BETWEEN T2.START_DTM AND T2.END_DTM AND T1.END_DTM BETWEEN T2.START_DTM AND T2.END_DTM AND ( T1.START_DTM<>T2.START_DTM OR T1.END_DTM<>T2.END_DTM); KEY START_DTM END_DTM ---------- --------- --------- 1234 01-JAN-01 30-MAR-01 1234 15-FEB-01 15-APR-01 1234 01-JUL-01 30-DEC-01
SELECT KEY, START_DTM, END_DTM FROM (SELECT T1.KEY, T1.START_DTM, T1.END_DTM, MIN(T1.START_DTM) OVER (PARTITION BY T1.KEY, T1.END_DTM) M_START_DTM, MAX(T1.END_DTM) OVER (PARTITION BY T1.KEY, T1.START_DTM) M_END_DTM FROM T1) WHERE START_DTM=M_START_DTM AND END_DTM=M_END_DTM; KEY START_DTM END_DTM ---------- --------- --------- 1234 01-JAN-01 30-MAR-01 1234 15-FEB-01 15-APR-01 1234 01-JUL-01 30-DEC-01
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Maxim Demenko provided the following solution:
Maybe, something like this
SQL> select key, min(start_dtm) start_dtm, max(end_dtm) end_dtm 2 from (select key, 3 start_dtm, 4 end_dtm, 5 sum(interval_change) over(partition by key order by end_dtm, start_dtm) interval_no 6 from (select key, 7 start_dtm, 8 end_dtm, 9 case 10 when (lag(end_dtm) 11 over(partition by key order by end_dtm, 12 start_dtm) - start_dtm) > 0 then 13 0 14 else 15 1 16 end interval_change 17 from t1)) 18 group by key, interval_no 19 order by key, interval_no 20 ; KEY START_DTM END_DTM ---------- -------------------- -------------------- 1234 01-Jan-2001 15-Apr-2001 1234 01-Jul-2001 30-Dec-2001
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The original poster followed up with the following solution that he developed:
Using the same table as above and this data:
create table t1 (key varchar2(10), start_dtm date, end_dtm date); insert into t1 values( 9999, to_date('01/01/2008', 'DD/MM/YYYY'), to_date('01/06/2008', 'DD/MM/YYYY')); insert into t1 values( 9999, to_date('01/02/2008', 'DD/MM/YYYY'), to_date('01/06/2008', 'DD/MM/YYYY')); insert into t1 values( 9999, to_date('01/02/2008', 'DD/MM/YYYY'), to_date('01/05/2008', 'DD/MM/YYYY')); insert into t1 values( 9999, to_date('01/02/2008', 'DD/MM/YYYY'), to_date('01/07/2008', 'DD/MM/YYYY'));
This gives a set of rows like:
|------------------------------------------| |---------------------------------------| |---------------------| |------------------------------------------|
And we expect this to turn into a single row like
01/01/08 |----------------------------------------------| 01/07/2008
insert into t1 values( 9999, to_date('01/08/2008', 'DD/MM/YYYY'), null); insert into t1 values( 9999, to_date('01/09/2008', 'DD/MM/YYYY'), to_date('01/10/2008', 'DD/MM/YYYY'));
This gives a set of rows like:
|---------------------------------------> |------------|
And we want an row that looks like
01/08/2008 |----------------------------> (null end date)
I think this query does it:
select key, min(start_dtm), max(end_dtm) from ( select key, start_dtm, end_dtm, max (grp) over (partition by key order by start_dtm asc) grp2 from ( SELECT key, start_dtm, NVL (end_dtm, '31-dec-4712') end_dtm, case when (start_dtm not between lag(start_dtm) over (partition by key order by start_dtm asc) and nvl( lag(end_dtm) over (partition by key order by start_dtm asc), '31-dec-4712') ) or lag(start_dtm) over (partition by key order by start_dtm asc) is null then row_number() over (partition by key order by start_dtm asc) end grp FROM t1 order by key, start_dtm asc ) a ) b group by key, grp2 GRP MIN(START MAX(END_D --------------------------------- --------- --------- 09999 01-JAN-08 01-JUL-08 09999 01-AUG-08 31-DEC-12
Basically I said, if you order the rows by increasing start date, then if rows overlap, the current start_dtm must be between the previous start_dtm and end_dtm – if there is no previous start_dtm or they don’t overlap, then its a new group.
The innermost select outputs something like:
KEY START_DTM END_DTM GRP --------------------------------- --------- --------- ---------- 09999 01-JAN-08 01-JUN-08 1 09999 01-FEB-08 01-JUL-08 09999 01-FEB-08 01-JUN-08 09999 01-FEB-08 01-MAY-08 09999 01-AUG-08 31-DEC-12 5 09999 01-SEP-08 01-OCT-08
Then we use analytics again to fill in the blanks in the enclosing query (b):
KEY START_DTM END_DTM GRP2 --------------------------------- --------- --------- ---------- 09999 01-JAN-08 01-JUN-08 1 09999 01-FEB-08 01-JUL-08 1 09999 01-FEB-08 01-JUN-08 1 09999 01-FEB-08 01-MAY-08 1 09999 01-AUG-08 31-DEC-12 5 09999 01-SEP-08 01-OCT-08 5
and then simply group by KEY, GRP in the outer query to get the result.
I am glad I went through the pain of figuring this out (took me a good 90 minutes) as I really feel like I get analytics now. It also allowed me to replace a temporary table and several 100 lines of PLSQL that was performing rather poorly and written by someone who doesn’t get analytics!
Would you need to change anything if the dates didn’t overlap, but were sequential?
Like Start Date 1/1/14, end date 1/31/14, Start Date 2/1/14 End Date 2/11/14.
I would like this bounced to one record: Start Date 1/1/14 End Date 2/11/14.
celauritz,
It has been several years since I created this example. My first thought is that you need to create the overlap – you could do that by adding one day to the ending date in the formula, so the 1/31/14 ending date would appear as 2/1/14 in the formula, thus causing an overlap situation.
Let’s set up a test table using the SQL statements at the start of this article (note that the dates in those SQL statements are in the format of dd/mm/yyyy, rather than mm/dd/yyyy):
Starting SQL statement, let’s see what dates are in the table:
You want the first two rows to be combined. Let’s see what happens when the next SQL statement in the blog article is executed:
The notes in the article state: “In the above, a 1 is output whenever there is a jump in the date range – if we use the COUNT analytical function to create a running count of the 1s and slide the above into an inline view, we actually create a column that may be used for grouping.” Based on that note, we need to fix the SQL statement above so that a 0 is output on the second row of the output. 1 must be added to the previous end date (M_END_DTM) to allow it to overlap the start date:
Next stage, carrying that one change forward:
Next stage, again with the one change carried forward:
I believe the above output is what you requested.
Hello Everyone,
Could you please help me in implementing below thing :
Input :
Key Date
48 8/25/2013
48 9/30/2013
48 11/1/2013
48 12/4/2013
48 1/4/2014
48 3/4/2014
Output :
Key Month Range
48 Aug 13-Sep 13
48 Nov 13-Jan 14
48 Mar 14
Need the month range for which the client has record in the table.
1000270748 12/4/2013
Let’s set up a test table:
The first thought that I have is to used the TRUNC function to find the first of the month for the D column (column TD below). We will then add one month to that date (column LTD) which may be compared to the first of the month in the next of the sorted rows. The LEAD function is used to retrieve the date from the next row (column ND), and the TRUNC function is applied to the date of the next row to obtain the first date of that month (column TND):
Now, we are able to slide the above into an inline view and determine if the first of the month of the date on the next row is after the value in the LTD column (this would mean that there is a gap in the months):
The above worked, but we have a problem if we want to use the approach that is demonstrated in this blog article. We need a 1 to appear in column C at the start of every sequence of dates, not at the end of the sequence of dates. We have hit a dead end with this approach.
Trying again, this time using LAG to look at the date from the previous row, rather than LEAD to look at the date from the next row in the ordered set of rows:
We should now be able to generate a column C that has the value 1 at the start of a new sequence of TD dates. Sliding the above into an inline view:
Next, we are able to use the SUM analytic function with an ORDER BY clause to assign the rows that are in the same month date sequence, the same grouping number (column SC):
Sliding the above into an inline view, we are able to group on columns KEY and SC to find the start and end month for each month date sequence:
The above example still needs a bit of cleanup to remove unnecessary columns from the inline views and a bit of formatting to clean up the final output.
Perfec.. Thank you so much for the solution.
Great implementation. You mention above an interesting case: “note, I just noticed that this may have problems if there are two rows with discrete date ranges that fall entirely within a third, larger date range”. Any ideas on how to handle that case?
Came up with an implementation using MAX(start_dtm) OVER (PARTITION BY key ORDER BY start_dtm ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) to see if the row would intersect any of the preceding range rather than LAG.
James,
I have been a bit preoccupied with a couple of unrelated items, so I have not had a chance to put together a solution for the potential issue that you highlighted. Did the MAX(START_DTM) OVER (PARTITION BY … method that you mentioned solve the potential issue that I mentioned?
Charles,
No problem! I know this is an older post that has gotten some recent interest, so I definitely understand.
I do think the solution I put together with the PARTITION … ROWS BETWEEN UNBOUNDED … works. I reversed the logic from your original, in which you check to see if the current row’s start is not between the previous row’s start and end to create a new group, and instead, I check to see if the current row’s start date is greater than the MAX(start date) OVER (…) the preceding rows in the partition. If so, I don’t create a new group and have an else condition that creates the group.
I haven’t profiled it, so I don’t know if the plan would be different if I had to do that for the start and end, but I at least saved myself some typing. When I get a minute, I’ll try to put an implementation together that uses the schema in the article and put it up here and see if you agree that it solves the problem.
Thanks again.
Worked this up quickly:
If you create a period that is completely intersected by one period and another that intersects the first but not the second, this potentially causes an issue in using LAG() and evaluating only the previous period. For example (forgive the slight syntax differences, as I am developing on SQL Server, and am representing date in MM/DD/YYYY format):
insert into t1 values (‘1234′, ’01/01/2001′, ’03/30/2001’);
insert into t1 values (‘1234′, ’01/02/2001′, ’03/30/2001’);
insert into t1 values (‘1234′, ’01/03/2001′, ’01/31/2001’);
insert into t1 values (‘1234′, ’02/15/2001′, ’04/15/2001’);
insert into t1 values (‘1234′, ’07/01/2001′, ’12/30/2001′);
The period Jan-3 to Jan 31 is entirely contained within the prior period but does not intersect the Feb-15 to Apr-15 period, which does intersect the Jan-1 to Mar-30 and the Jan-2 to Mar-30 period.
Using the implementation using LAG yields the following periods:
1234 2001-01-01 2001-03-30
1234 2001-02-15 2001-04-15
1234 2001-07-01 2001-12-30
Clearly, the Feb-15 period should be included in the prior period. So, changing this implementation to this:
select id, min(start_dtm), max(end_dtm)
from
(
select id, start_dtm, end_dtm, max (grp) over (partition by id order by start_dtm asc) grp2
from
(
SELECT id,
start_dtm,
isnull (end_dtm, ’31-dec-4712′) end_dtm,
case
when
(start_dtm not between min(start_dtm) over (partition by id order by start_dtm asc rows between unbounded preceding and 1 preceding)
and isnull( max(end_dtm) over (partition by id order by start_dtm asc rows between unbounded preceding and 1 preceding), ’31-dec-4712’) )
or lag(start_dtm) over (partition by id order by start_dtm asc) is null
then
row_number() over (partition by id order by start_dtm asc)
end grp
FROM t1
) a
) b
group by id, grp2
Yields the following expected result:
1234 2001-01-01 2001-04-15
1234 2001-07-01 2001-12-30
hi, i got one new requirement on forms for overlapping dates,
here is my task ,say i want to check based on id ( not for entire table ) for date ranges,
i will enter in first line start_date=01/FEB/2017 and end_date=10/FEB/2017 and second line
start_date=15/FEB/2017 and end_date=25/FEB/2017 and now in third line if i enter
start_date=30/Jan/2017 and end_date=12/FEB/2017 it should not accept because it is with in range for first line,
whenever it is within range it should fire trigger ,
example below ,
please provide me solution for this query .
Thanks in advance.
Regards,
Vishnu Vardhan.
The SQL for merging overlapping ranges is almost correct. However, it assumes that a row and the one prior to it MUST overlap in order to merge. Unfortunately, you could end up with a situation where Row1 overlaps with Row2, Row2 does NOT overlap with Row3, but Row1 DOES overlap with Row3. In that case, the three rows must be merged into one row. To handle this, we must check to see if there is overlap, not just with the previous row, but with any of the previous rows. We do this using the analytic min() and max() functions. The SQL becomes:
select key,
min(start_dtm) start_dtm,
max(end_dtm) end_dtm
from (select key,
start_dtm,
end_dtm,
max(grp) over(partition by key order by start_dtm asc) grp2
from (select key,
nvl(end_dtm, ’31-dec-4712′) end_dtm,
case
when (start_dtm not between min(prev_start_dtm)
over(partition by key order by start_dtm) and
nvl(max(prev_end_dtm)
over(partition by key order by start_dtm),
’31-dec-4712′))
or min(prev_start_dtm)
over(partition by key order by start_dtm) is null then
row_number() over(partition by key order by start_dtm)
end grp
from (select key,
lag(start_dtm) over(partition by key order by start_dtm) prev_start_dtm,
lag(end_dtm) over(partition by key order by start_dtm) prev_end_dtm
from t1)
order by key,
start_dtm) a) b
group by key,
grp2;
Note that I introduced an additional nested SQL query. This is necessary in order to capture the previous row start/end values. Running the queries individually starting from the innermost sub-query will demonstrate the reason for this additional level. Basically, we cannot take the min/max up to the current row. We have to do it over the previous row.
Can someone please help me to solve below issue in SQL Oracle?
I have data like below:
PRDCT Start_Date End_Date
——————————————————
D 05/01/2017 05/31/2017
V 05/01/2017 12/31/2020
In above example since Product ‘V’ duration (till 05/31/2017) also the part of Product ‘D’,
so the Expected result should be as below (This can be achieve using Pivot/Unpivot Oracle function):
Expected result:
PRDCT1 PRDCT2 Start_Date End_Date
————————————————————
D V 05/01/2017 05/31/2017
NULL D 06/01/2017 12/31/2020