How Many Ways to Solve this SQL Problem?

6 07 2011

July 6, 2011

Since there were so many unique solutions to the last blog article that posed a SQL challenge, I thought that I would try another blog article that asks a similar type of question.  Assume that someone showed you the following output:

 C2   D
--- ---
100   0
150  50
200  50
201   1
300  99
350  50
400  50
500 100 

You have the following table definition, and rows in the table:

CREATE TABLE T2 (
  C1 NUMBER,
  C2 NUMBER);

INSERT INTO T2 VALUES (1,100);
INSERT INTO T2 VALUES (4,150);
INSERT INTO T2 VALUES (7,200);
INSERT INTO T2 VALUES (8,201);
INSERT INTO T2 VALUES (10,300);
INSERT INTO T2 VALUES (14,350);
INSERT INTO T2 VALUES (18,400);
INSERT INTO T2 VALUES (24,500);

COMMIT;

Assume that you know nothing other than the fact that the C2 values are listed in ascending order when sorted by column C1.  How many different ways can this particular problem be solved.  Yes, there is an easy way, but assume that you were trying to “help educate” the person who provided the requested output.

My least-shortest-path solution follows:

SELECT
  C2,
  0 D
FROM
  T2
WHERE
  C1=(SELECT
        MIN(C1)
      FROM
        T2)
UNION ALL
SELECT
  V2.C2,
  V2.C2-MAX(T2.C2) D
FROM
  T2,
  (SELECT
    C1,
    C2
  FROM
    T2) V2
WHERE
  T2.C1<V2.C1
GROUP BY
  V2.C2
ORDER BY
  C2;

  C2    D
---- ----
 100    0
 150   50
 200   50
 201    1
 300   99
 350   50
 400   50
 500  100

8 rows selected. 

In the above, the row with the 0 in the D column was the hardest part of the solution.  Why would I use UNION ALL and not UNION – what was not in the specification?

This blog article was inspired by an old question found in a Usenet group from 1998 – if you were answering the question in 1998, would your answer be any different?  Be creative with your solution.  While you are thinking about a solution, take a look at this old Usenet thread and consider how difficult it was to find the “50 highest paid workers” in the last century.


Actions

Information

17 responses

6 07 2011
Stew Ashton
select C2, C2-LAG(C2,1,C2) over(order by C1) D from T2
order by c1;

In Oracle, obviously a job for analytics.

6 07 2011
Charles Hooper

Stew,

Nice example, and very close to what I had in mind when I wrote “Yes, there is an easy way”:

SELECT
  C2,
  C2-NVL(LAG(C2,1) OVER (ORDER BY C2), C2) D
FROM
  T2
ORDER BY
  C2;

As can be seen by the above, I forgot that it was possible to supply a default value for the result of the LAG function on the first row.

Any other solutions? Possibly something that either uses different analytic functions, or no analytic functions at all (for those pre-Oracle 8i Enterprise Edition databases / pre-Oracle 9i Standard Edition databases?

7 07 2011
Charles Hooper

I just realized that in the above example using the LAG analytic funtion, I *should* have ordered the rows by column C1 rather than C2 – the way that Stew ordered the rows was the *intended* ordering. The above SQL statement was an example of a quick note that I made while writing the article. In this case, Stew’s example has a couple of advantages over the LAG example that I had in mind.

6 07 2011
Maxim

Probably, everything what can be done with analytics, can as well be done with joins.
Assumed, lag would be not available,
you get that results with

select t1.c2, t1.c2 - nvl(max(t2.c2),t1.c2) d
from t2 t1, t2 t2
where t1.c2>t2.c2(+)
group by t1.c2
order by 1;

Best regards

Maxim

6 07 2011
Charles Hooper

Maxim,

That is a bit more efficient than what I posted in the blog article – nice example.

So far, we have:
* An example that demonstrates a query similar to what someone might write, if that person knows about UNION ALL, subqueries in the WHERE clause, and inline views, but not the syntax for an Oracle specific outer join (possibly the SQL in 10 Minutes book reader… that was me at one point, but I do not think that the book covers inline views, so that part of the SQL statement would need to be collapsed into the main query)
* A potentially efficient SQL statement using a self-join that works with both older and newer Oracle Database release versions
* A potentially (more) efficient SQL statement using the LAG analytic function to read the previous row’s value in a column that works with more recent Oracle Database release versions

Any other solutions? (yes!)

7 07 2011
gary

Scalar subquery rather than a regular join or inline view

select c2, nvl((select t2.c2-max(b.c2) from t2 b where b.c2 < t2.c2),0)
from t2;
7 07 2011
Charles Hooper

Gary,

Nice example with a scalar subquery.

Are there any other techniques for solving this problem? Ask yourself, “What if Oracle does not have this capability, what would I do different?” If we set efficiency aside for a moment, and assume that the following features were not supported (or a developer did not know that these were possibilities), how else could this problem be solved:
* Scalar subqueries (Gary’s example)
* LAG/LEAD analytic functions to peek at previous and next row values for a column (Stew’s example)
* Self join (Maxim’s example)

For instance, could the following be used as a starting point for a solution?:

SELECT
  LTRIM(SYS_CONNECT_BY_PATH(C2, ','),',') C2_LISTING
FROM
  (SELECT
    C2,
    ROW_NUMBER() OVER (ORDER BY C1) DR
  FROM
    T2)
WHERE
  CONNECT_BY_ISLEAF = 1
START WITH
  DR = 1
CONNECT BY DR = PRIOR DR+1;
 
C2_LISTING
-------------------------------
100,150,200,201,300,350,400,500

Note in the above that I used ROW_NUMBER rather than RANK or DENSE_RANK – the specification provided at the start of this article does not rule out the possibility of rows with duplicate values.

8 07 2011
Chris Saxon

From your starting point Charles, I came up with the following:

SELECT c2, c2 - nvl(prior c2, c2)
FROM
  (SELECT
    C2,
    ROW_NUMBER() OVER (ORDER BY C1) DR
  FROM
    T2)
START WITH
  DR = 1
CONNECT BY DR = PRIOR DR+1;

It’s also possible to generate the result by unioning the table with itself, but generating row numbers for the two sets that are out of sequence with each-other by one. Then pivoting the results to get the values from each set as separate columns and then subtracting the column values:

select y, y-nvl(z, y) from (
  select r, max(case when c = 1 then c2 end) y, max(case when c = 2 then c2 end) z
  from (
    select * from (select rownum r, t.* from (select c1 , c2 , 1 c from t2) t order by c1 desc)
    union all
    select * from (select rownum+1 r, t.* from (select c1 , c2 , 2 c from t2) t order by c1 desc)
  )
  group by r
)
where y is not null
order by r;

ROW_NUMBER could have been used instead of the rownum … order by, but I wanted to construct an example without analytic functions

8 07 2011
Charles Hooper

Chris,

Nice job finishing up the starting point that I left as a challenge. The fact that PRIOR can be used in the SELECT clause when using a CONNECT BY clause in the query probably is not well known (I occasionally forget the syntax).

Complicating your second example is the fact that the C1 values in the table have gaps in the sequence – this was intentional so that people could not create a simple self join (outer-join) that specified T1.C1 = T2.C1+1. What you posted is a very good work-around for the lack of gapless sequential values, and is a solution that I had not previously considered as a solution for this problem.

Have we listed all of the possible SQL only solutions at this point?

9 07 2011
Stew Ashton

Well, since you were so nice to my first solution, here’s another:

select C2, D
FROM T2
model
  dimension by ( ROW_NUMBER() over (order by C1) RN )
  measures ( C2, 0 D )
  rules ( D[rn>1] order by rn = c2[CV()] - c2[cv()-1] );

As in two other solutions, you need gapless sequential numbers to navigate to the previous row.

By the way, I didn’t know about that use of PRIOR, thanks Chris!

9 07 2011
Charles Hooper

Stew,

That is an interesting solution.

I have heard of the model clause, but I have never used it. I need to spend some time finding uses for the model clause to fully understand how it works and when to use it.

10 07 2011
Stew Ashton

Here is a variation of Chris’ second solution. I use a cartesian join instead of union all (which should save a full scan) and I don’t really pivot since I can just make C2 negative when needed and use SUM.

select max(c2) c2, sum(decode(pass,1,decode(rn,1,0,C2),-C2)) D from
(select level pass from DUAL connect by level <= 2) a,
(select rownum RN, T2.* from T2)
group by RN+pass
having SUM(PASS) <> 2
order by max(C1);

On “when to use the Model clause”, here is a link to a problem where the MODEL clause is the solution I prefer.
http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:13946369553642#3478381500346951056

10 07 2011
Stew Ashton

Not sure what happened there, should be “having sum(pass) not equal to 2”

10 07 2011
Stew Ashton

…and now miraculously the (not equals) sign has reappeared !?

10 07 2011
Charles Hooper

I saw your comment and quietly fixed the SQL statement to match your description. The less than and greater than characters tend to get lost because those characters have special meaning in HTML. != is a good substitute for <> or you can use (without spaces):
& lt; (for <)
& gt; (for >)

I am wondering if this part of the SQL statement could cause problems if the data is not added to the table blocks in a completely sequential order possibly due to the amount of free space remaining in the table’s blocks when the new rows are added:

(select rownum RN, T2.* from T2)

The current output of your SQL statement shows the expected result.

        C2          D
---------- ----------
       100          0
       150         50
       200         50
       201          1
       300         99
       350         50
       400         50
       500        100

Now let’s change the order of the rows in the table’s blocks:

TRUNCATE TABLE T2;
 
INSERT INTO T2 VALUES (1,100);
INSERT INTO T2 VALUES (10,300);
INSERT INTO T2 VALUES (14,350);
INSERT INTO T2 VALUES (4,150);
INSERT INTO T2 VALUES (18,400);
INSERT INTO T2 VALUES (24,500);
INSERT INTO T2 VALUES (7,200);
INSERT INTO T2 VALUES (8,201);

The output of your SQL statement:

        C2          D
---------- ----------
       100          0
       201          1
       300        200
       350         50
       350       -200
       400        250
       500        100
       500       -300

8 rows selected.

What you posted is an interesting approach to the problem – I believe that your query can be fixed by removing the ROWNUM from “(select rownum RN, T2.* from T2)”, sliding that into an inline view with an ORDER BY clause, and adding ROWNUM outside the inline view. I have NOT attempted this change in your SQL statement, but I suspect that it will work.

Thank you for supplying the link to the AskTom thread. I need to find some time to take a look at that thread.

11 07 2011
Stew Ashton

Thank you, Charles. This is poetic justice: I have recently pointed out the lack of ORDER BY in other queries, and in my turn I am guilty of disorderly conduct 🙂 Here is a corrected (and perhaps correct) version:
select max(C2) C2, SUM(DECODE(PASS,1,DECODE(RN,1,0,C2),-C2)) D from
( select level PASS from DUAL connect by level <= 2 ),
( select rownum RN, C1, C2 from ( select * from T2 order by C1 ) )
group by RN+PASS
having SUM(PASS) != 2
order by max(C1);

11 07 2011
Charles Hooper

Stew,

Nice solution that produces the expected output (with the out-of-order inserted rows):

SQL> select max(C2) C2, SUM(DECODE(PASS,1,DECODE(RN,1,0,C2),-C2)) D from
  2  ( select level PASS from DUAL connect by level <= 2 ),
  3  ( select rownum RN, C1, C2 from ( select * from T2 order by C1 ) )
  4  group by RN+PASS
  5  having SUM(PASS) != 2
  6  order by max(C1);

        C2          D
---------- ----------
       100          0
       150         50
       200         50
       201          1
       300         99
       350         50
       400         50
       500        100

It appears that a small adjustment to the above SQL creates another possible solution (this is just a small change, and should not be considered a unique solution to the problem):

select max(C2) C2, (MAX(C2)-MIN(C2)) D from
( select level PASS from DUAL connect by level <= 2 ),
( select rownum RN, C1, C2 from ( select * from T2 order by C1 ) )
group by RN+PASS
having SUM(PASS) != 2
order by max(C1);
 
        C2          D
---------- ----------
       100          0
       150         50
       200         50
       201          1
       300         99
       350         50
       400         50
       500        100

Leave a reply to Charles Hooper Cancel reply