Finding Rows with Common Attributes – Roman to Find a Solution in Something New

1 07 2011

July 1, 1011

A recent thread in the comp.databases.oracle.server Usenet group (actually two threads) asked an interesting question.  Assume that you had a detail table that contained several attributes for each of the unique key values.  How would one go about finding all of the unique key values that share the same set of attributes?  The sample set provided by the OP looks like this:

COL1   COL2
----  -----
I         a
I         b
I         c
II        a
II        b
III       a
III       b
III       c 

For the above, assume that the OP was interested in the attributes of “I”: a,b,c.  “II” lacks a “c” attribute, while “III” has the required “a”, “b”, and “c” attributes.  So, the OP would like to return C1 value “III” but not “II”.  I wonder if there is a simple solution for the OP?

First, let’s create our test data.  COL1 appears to contain Roman numbers – if we go beyond the number 3, those could be tricky to generate (unless of course you find the RN format parameter for the TO_CHAR function).  Let’s first create a temporary work table that contains the Roman numbers from 1 to 100 and a random number between 1 and 10:

CREATE TABLE T1_TEMP AS
SELECT
  TRIM(TO_CHAR(ROWNUM,'RN')) C1,
  TRUNC(DBMS_RANDOM.VALUE(1,10)+1) C2,
  ROWNUM C3
FROM
  DUAL
CONNECT BY
  LEVEL<=100; 

Let’s see what is in the T1_TEMP table:

COLUMN C1 FORMAT A10

SELECT
  *
FROM
  T1_TEMP
ORDER BY
  C3;

C1                 C2         C3
---------- ---------- ----------
I                  10          1
II                  4          2
III                 7          3
IV                  9          4
V                   8          5
VI                 10          6
VII                 9          7
VIII                4          8
IX                  4          9
X                  10         10
...
XCV                 5         95
XCVI                4         96
XCVII               8         97
XCVIII              7         98
XCIX               10         99
C                   4        100

100 rows selected. 

The row with the value “I” in column C1 has the number 10 in column C2, but that number might be a bit different in your temporary work table.  Column C2 will determine the number of attributes that are added for each of the values found in column C1 when we create the table T1 (note that we could have defined column C2 with the function CHR(96 + COUNTER) to place lowercase letters in that column, rather than numbers, to help reproduce the OP’s dataset):

CREATE TABLE T1 AS
SELECT
  T1_TEMP.C1,
  V1.COUNTER C2
FROM
  T1_TEMP,
  (SELECT
     ROWNUM COUNTER
   FROM
     DUAL
   CONNECT BY
     LEVEL<=10) V1
WHERE
  T1_TEMP.C2>=V1.COUNTER; 

Let’s see what is in table T1:

SELECT
  *
FROM
  T1
ORDER BY
  C1,
  C2;

C1                 C2
---------- ----------
C                   1
C                   2
C                   3
C                   4
I                   1
I                   2
I                   3
I                   4
I                   5
I                   6
I                   7
I                   8
I                   9
I                  10
...
XXXVII              1
XXXVII              2
XXXVII              3
XXXVIII             1
XXXVIII             2
XXXVIII             3
XXXVIII             4
XXXVIII             5
XXXVIII             6

634 rows selected.  

From the above output, you can see that we now have the number of rows in table T1 for each distinct value of C1 as was specified in table T1_TEMP.  An interesting side-note, the Roman number 100 (C) is less than the Roman number 1 (I) – I guess that explains why computers do not natively use Roman numbers for calculations.  🙂

For the next step, we need to collapse the different C2 values for each of the unique C1 values into a single row.  Oracle Database 11.2.0.1 introduced the LISTAGG function that makes easy work of this task, as shown in this earlier blog article.

COLUMN C2_LISTING FORMAT A22

SELECT
  C1,
  LISTAGG(TO_CHAR(C2), ',') WITHIN GROUP (ORDER BY C2) C2_LISTING
FROM
  T1
GROUP BY
  C1
ORDER BY
  C1;

C1         C2_LISTING
---------- --------------------
C          1,2,3,4
I          1,2,3,4,5,6,7,8,9,10
II         1,2,3,4
III        1,2,3,4,5,6,7
IV         1,2,3,4,5,6,7,8,9
IX         1,2,3,4
...
XXXV       1,2,3,4,5,6,7,8,9
XXXVI      1,2,3,4,5
XXXVII     1,2,3
XXXVIII    1,2,3,4,5,6

100 rows selected. 

The question remains, how can I find all of the unique C1 values that have all of the same attributes as the C1 value “I” – in this case 1,2,3,4,5,6,7,8,9,10?  One method slides the above query into a WITH block and then the WITH block is referenced twice in the main query:

WITH MY_VIEW AS
(SELECT
  C1,
  LISTAGG(TO_CHAR(C2), ',') WITHIN GROUP (ORDER BY C2) C2_LISTING
FROM
  T1
GROUP BY
  C1)
SELECT
  V2.C1,
  V2.C2_LISTING
FROM
  MY_VIEW V1,
  MY_VIEW V2
WHERE
  V1.C1='I'
  AND V1.C1<>V2.C1
  AND V1.C2_LISTING=V2.C2_LISTING
ORDER BY
  V2.C1;

C1         C2_LISTING
---------- --------------------
LVII       1,2,3,4,5,6,7,8,9,10
LXXI       1,2,3,4,5,6,7,8,9,10
LXXIII     1,2,3,4,5,6,7,8,9,10
VI         1,2,3,4,5,6,7,8,9,10
X          1,2,3,4,5,6,7,8,9,10
XCIX       1,2,3,4,5,6,7,8,9,10
XV         1,2,3,4,5,6,7,8,9,10
XXIX       1,2,3,4,5,6,7,8,9,10
XXXI       1,2,3,4,5,6,7,8,9,10

9 rows selected. 

How else might you solve the problem posted by the OP in the Usenet thread?


Actions

Information

21 responses

1 07 2011
Toon Koppelaars

Another way to solve this is to spot that the query is of the following form:

select t.C1
from (select distinct C1 from T) t
where set-of-C2-values-with-t.C1 = set-of-C2-values-with-a-given-C1-value

So it involves set equality, which is not supported in SQL.
But we can apply rewrite-rules to transform above query-text into one that only involves operators that *are* suppored in SQL. Cary has a blog describing how to do that for set equality:
http://carymillsap.blogspot.com/2009/03/last-call-for-c-j-date-course.html

1 07 2011
Charles Hooper

Toon,

Thank you for the comment – it gave me another idea. Why not perform a full outer join between the set with the C1 value equal to ‘I’ and the set that excludes the C1 values that are equal to ‘I’? Once that is done, you could count the NULL values, and throw out any C1 grouping value with at least 1 NULL value. Something like this:

SELECT
  C1,
  COUNT(V1_C2),
  COUNT(V2_C2)
FROM
  (SELECT
    NVL(V2.C1,V1.C1) C1,
    V1.C2 V1_C2,
    V2.C2 V2_C2
  FROM
    (SELECT
      C1,
      C2
    FROM
      T1
    WHERE
      C1='I') V1
    FULL OUTER JOIN
    (SELECT
      C1,
      C2
    FROM
      T1
    WHERE
      C1<>'I') V2
    ON
      V1.C2=V2.C2)
GROUP BY
  C1
ORDER BY
  C1;

The unfortunate thing is that Oracle treats the outer join as an inner join, and produces the following output (COUNT(V1_C2) was expected to be 10 in all rows using my sample data):

C1         COUNT(V1_C2) COUNT(V2_C2)
---------- ------------ ------------
C                     4            4
II                    4            4
III                   7            7
IV                    9            9
IX                    4            4
L                     3            3
LI                    5            5
LII                   7            7
LIII                  3            3
LIV                   8            8
LIX                   5            5
LV                    5            5
LVI                   5            5
LVII                 10           10
...
XXXV                  9            9
XXXVI                 5            5
XXXVII                3            3
XXXVIII               6            6
 
99 rows selected.

A round-about way of getting around the problem is to first create a Cartesian join between all of the unique C1 values that are not ‘I’ and all of the unique attributes of the rows with a C1 value of ‘I’, as shown below:

SELECT
  V2.C1,
  V1.C2
FROM
  (SELECT
    C1,
    C2
  FROM
    T1
  WHERE
    C1='I') V1,
  (SELECT DISTINCT
    C1
  FROM
    T1
  WHERE
    C1<>'I') V2
 
C1                 C2
-----------------  --
...
LXXXIII             9
LXXXVI              9
IV                 10
XIII               10
XIV                10
XV                 10
XVIII              10
XXI                10
XLI                10
XLVIII             10
LV                 10
LXI                10
LXV                10
LXVII              10
LXVIII             10
LXXIV              10
XC                 10
 
990 rows selected.

Now, sliding the above into an inline view, we are able to full outer join the results of the above with all rows from table T1 where C1 is not ‘I’. Sliding that SQL statement into another inline view allows us to group the results to find the count of the NULL values, and use a HAVING clause to make certain that there are no NULL values on either side of the full outer join:

SELECT
  NVL(V11.C1,V12.C1) C1
FROM
  (SELECT
    V2.C1,
    V1.C2
  FROM
    (SELECT
      C1,
      C2
    FROM
      T1
    WHERE
      C1='I') V1,
    (SELECT DISTINCT
      C1
    FROM
      T1
    WHERE
      C1<>'I') V2) V11
  FULL OUTER JOIN
    (SELECT
      C1,
      C2
    FROM
      T1
    WHERE
      C1<>'I') V12
ON
  V11.C1=V12.C1
  AND V11.C2=V12.C2
GROUP BY
  NVL(V11.C1,V12.C1)
HAVING
  COUNT(NVL2(V12.C2,NULL,1))=0
  AND COUNT(NVL2(V11.C2,NULL,1))=0
ORDER BY
  C1;
 
C1
----------
LVII
LXXI
LXXIII
VI
X
XCIX
XV
XXIX
XXXI

9 rows selected.

The above results match the results found in this article. Let’s try again with the value “II” in column C1. First we will return the list using the original SQL statement:

WITH MY_VIEW AS
(SELECT
  C1,
  LISTAGG(TO_CHAR(C2), ',') WITHIN GROUP (ORDER BY C2) C2_LISTING
FROM
  T1
GROUP BY
  C1)
SELECT
  V2.C1,
  V2.C2_LISTING
FROM
  MY_VIEW V1,
  MY_VIEW V2
WHERE
  V1.C1='II'
  AND V1.C1<>V2.C1
  AND V1.C2_LISTING=V2.C2_LISTING
ORDER BY
  V2.C1;
 
C1         C2_LISTING
---------- ----------
C          1,2,3,4
IX         1,2,3,4
LXX        1,2,3,4
LXXV       1,2,3,4
LXXVII     1,2,3,4
LXXXIII    1,2,3,4
LXXXV      1,2,3,4
VIII       1,2,3,4
XC         1,2,3,4
XCIII      1,2,3,4
XCVI       1,2,3,4
XIII       1,2,3,4
XXIII      1,2,3,4
 
13 rows selected.

Now trying again with our fancy SQL statement using the full outer joins:

SELECT
  NVL(V11.C1,V12.C1) C1
FROM
  (SELECT
    V2.C1,
    V1.C2
  FROM
    (SELECT
      C1,
      C2
    FROM
      T1
    WHERE
      C1='II') V1,
    (SELECT DISTINCT
      C1
    FROM
      T1
    WHERE
      C1<>'II') V2) V11
  FULL OUTER JOIN
    (SELECT
      C1,
      C2
    FROM
      T1
    WHERE
      C1<>'II') V12
ON
  V11.C1=V12.C1
  AND V11.C2=V12.C2
GROUP BY
  NVL(V11.C1,V12.C1)
HAVING
  COUNT(NVL2(V12.C2,NULL,1))=0
  AND COUNT(NVL2(V11.C2,NULL,1))=0
ORDER BY
  C1;
 
C1
----------
C
IX
LXX
LXXV
LXXVII
LXXXIII
LXXXV
VIII
XC
XCIII
XCVI
XIII
XXIII
 
13 rows selected.

That matches, let’s try again with “III”:

WITH MY_VIEW AS
(SELECT
  C1,
  LISTAGG(TO_CHAR(C2), ',') WITHIN GROUP (ORDER BY C2) C2_LISTING
FROM
  T1
GROUP BY
  C1)
SELECT
  V2.C1,
  V2.C2_LISTING
FROM
  MY_VIEW V1,
  MY_VIEW V2
WHERE
  V1.C1='III'
  AND V1.C1<>V2.C1
  AND V1.C2_LISTING=V2.C2_LISTING
ORDER BY
  V2.C1;
 
C1         C2_LISTING
---------- -------------
LII        1,2,3,4,5,6,7
LXVI       1,2,3,4,5,6,7
LXXVI      1,2,3,4,5,6,7
LXXXVII    1,2,3,4,5,6,7
XCVIII     1,2,3,4,5,6,7
XLIII      1,2,3,4,5,6,7
XLVI       1,2,3,4,5,6,7
XVII       1,2,3,4,5,6,7
XXVIII     1,2,3,4,5,6,7
XXX        1,2,3,4,5,6,7
 
10 rows selected.
SELECT
  NVL(V11.C1,V12.C1) C1
FROM
  (SELECT
    V2.C1,
    V1.C2
  FROM
    (SELECT
      C1,
      C2
    FROM
      T1
    WHERE
      C1='III') V1,
    (SELECT DISTINCT
      C1
    FROM
      T1
    WHERE
      C1<>'III') V2) V11
  FULL OUTER JOIN
    (SELECT
      C1,
      C2
    FROM
      T1
    WHERE
      C1<>'III') V12
ON
  V11.C1=V12.C1
  AND V11.C2=V12.C2
GROUP BY
  NVL(V11.C1,V12.C1)
HAVING
  COUNT(NVL2(V12.C2,NULL,1))=0
  AND COUNT(NVL2(V11.C2,NULL,1))=0
ORDER BY
  C1;
 
C1
----------
LII
LXVI
LXXVI
LXXXVII
XCVIII
XLIII
XLVI
XVII
XXVIII
XXX
 
10 rows selected.

Any other solutions to the problem? True set operations?

2 07 2011
Charles Hooper

Another thought using the Oracle supplied set operations. First, a Cartesian join like what was used in the previous example:

SELECT
  V2.C1,
  V1.C2
FROM
  (SELECT
    C1,
    C2
  FROM
    T1
  WHERE
    C1='I') V1,
  (SELECT DISTINCT
    C1
  FROM
    T1
  WHERE
    C1'I') V2;
 
...
LIII        10
LXXI        10
LXXIII      10
LXXVIII     10
XCI         10
XCIX        10
 
990 rows selected.

Next, we will UNION ALL the above Cartesian join with all values from table T1 that do not have “I” in column C1:

SELECT
  V2.C1,
  V1.C2
FROM
  (SELECT
    C1,
    C2
  FROM
    T1
  WHERE
    C1='I') V1,
  (SELECT DISTINCT
    C1
  FROM
    T1
  WHERE
    C1'I') V2
UNION ALL
SELECT
  C1,
  C2
FROM
  T1
WHERE
  C1'I';
 
...
XXIX        10
XXXI        10
LVII        10
LXXI        10
LXXIII      10
LXXXII      10
XCIX        10
 
1601 rows selected.

Now if we can group by columns C1 and C2 in the above output, all rows with a COUNT of 2 are of interest, but there may be some cases where there a COUNT of 1 will exist for the same C1 value. Therefore, we need to find those rows with a COUNT of 1 and work backwards:

SELECT
  C1,
  C2,
  COUNT(*) MATCHES
FROM
(SELECT
  V2.C1,
  V1.C2
FROM
  (SELECT
    C1,
    C2
  FROM
    T1
  WHERE
    C1='I') V1,
  (SELECT DISTINCT
    C1
  FROM
    T1
  WHERE
    C1'I') V2
UNION ALL
SELECT
  C1,
  C2
FROM
  T1
WHERE
  C1'I')
GROUP BY
  C1,
  C2
HAVING
  COUNT(*)<2
ORDER BY
  C1,
  C2;
 
C1          C2 MATCHES
---------- --- -------
C            5       1
C            6       1
C            7       1
C            8       1
C            9       1
C           10       1
II           5       1
II           6       1
II           7       1
II           8       1
II           9       1
II          10       1
III          8       1
...
XXXVII      10       1
XXXVIII      7       1
XXXVIII      8       1
XXXVIII      9       1
XXXVIII     10       1
 
379 rows selected.

Now to clean up the above, just listing those C1 values that do not contain the same elements as the C1 value “I”:

 
SELECT DISTINCT
  C1
FROM
(SELECT
  V2.C1,
  V1.C2
FROM
  (SELECT
    C1,
    C2
  FROM
    T1
  WHERE
    C1='I') V1,
  (SELECT DISTINCT
    C1
  FROM
    T1
  WHERE
    C1'I') V2
UNION ALL
SELECT
  C1,
  C2
FROM
  T1
WHERE
  C1'I')
GROUP BY
  C1,
  C2
HAVING
  COUNT(*)<2
ORDER BY
  C1;
 
C1
-------
C
II
III
IV
IX
L
LI
LII
LIII
LIV
LIX
LV
...
XXXV
XXXVI
XXXVII
XXXVIII
 
90 rows selected.

Now the final step, using the MINUS set operator to remove those C1 values:

SELECT
  C1
FROM
  T1
WHERE
  C1'I'
MINUS
SELECT
  *
FROM
(SELECT DISTINCT
  C1
FROM
(SELECT
  V2.C1,
  V1.C2
FROM
  (SELECT
    C1,
    C2
  FROM
    T1
  WHERE
    C1='I') V1,
  (SELECT DISTINCT
    C1
  FROM
    T1
  WHERE
    C1'I') V2
UNION ALL
SELECT
  C1,
  C2
FROM
  T1
WHERE
  C1'I')
GROUP BY
  C1,
  C2
HAVING
  COUNT(*)<2)
ORDER BY
  C1;
 
C1
----------
LVII
LXXI
LXXIII
VI
X
XCIX
XV
XXIX
XXXI

9 rows selected.

Note that my dataset is not exactly the same as in the article and the previous comment – it is very close, but not identical.

Let’s try the new approach with the “II” values:

SELECT
  C1
FROM
  T1
WHERE
  C1'II'
MINUS
SELECT
  *
FROM
(SELECT DISTINCT
  C1
FROM
(SELECT
  V2.C1,
  V1.C2
FROM
  (SELECT
    C1,
    C2
  FROM
    T1
  WHERE
    C1='II') V1,
  (SELECT DISTINCT
    C1
  FROM
    T1
  WHERE
    C1'II') V2
UNION ALL
SELECT
  C1,
  C2
FROM
  T1
WHERE
  C1'II')
GROUP BY
  C1,
  C2
HAVING
  COUNT(*)<2)
ORDER BY
  C1;
 
C1
----------
C
IX
LXX
LXXV
LXXVII
LXXVIII
LXXXV
VIII
XC
XCIII
XCVI
XIII
XXIII
 
13 rows selected.

The above is slightly different from what we saw earlier, so we will pass the original SQL statement over the dataset for confimation that the new SQL statement works:

C1         C2_LISTING
---------- ----------
C          1,2,3,4
IX         1,2,3,4
LXX        1,2,3,4
LXXV       1,2,3,4
LXXVII     1,2,3,4
LXXVIII    1,2,3,4
LXXXV      1,2,3,4
VIII       1,2,3,4
XC         1,2,3,4
XCIII      1,2,3,4
XCVI       1,2,3,4
XIII       1,2,3,4
XXIII      1,2,3,4
 
13 rows selected.

Let’s try again with “III”:

SELECT
  C1
FROM
  T1
WHERE
  C1'III'
MINUS
SELECT
  *
FROM
(SELECT DISTINCT
  C1
FROM
(SELECT
  V2.C1,
  V1.C2
FROM
  (SELECT
    C1,
    C2
  FROM
    T1
  WHERE
    C1='III') V1,
  (SELECT DISTINCT
    C1
  FROM
    T1
  WHERE
    C1'III') V2
UNION ALL
SELECT
  C1,
  C2
FROM
  T1
WHERE
  C1'III')
GROUP BY
  C1,
  C2
HAVING
  COUNT(*)<2)
ORDER BY
  C1;
  
C1
----------
LII
LXVI
LXXVI
LXXXVII
XCVIII
XLIII
XLVI
XVII
XXVIII
XXX
 
10 rows selected.

The results of the original SQL statement for confirmation:

C1         C2_LISTING
---------- -------------
LII        1,2,3,4,5,6,7
LXVI       1,2,3,4,5,6,7
LXXVI      1,2,3,4,5,6,7
LXXXVII    1,2,3,4,5,6,7
XCVIII     1,2,3,4,5,6,7
XLIII      1,2,3,4,5,6,7
XLVI       1,2,3,4,5,6,7
XVII       1,2,3,4,5,6,7
XXVIII     1,2,3,4,5,6,7
XXX        1,2,3,4,5,6,7
 
10 rows selected.

I see that someone in the Usenet thread proposed using PL/SQL for the solution. Any other pure SQL solutions?

2 07 2011
Toon Koppelaars

Here’s another pure SQL solution:

select t.C1
from (select distinct C1 from T where C1 != 'the given C1') t
where not exists
        (select t2.C2
         from T t2
         where t2.C1 = t.C1
           and not exists
                 (select 'x'
                  from T t3
                  where t3.C2 = t2.C2 and t3.C1 = 'the given C1')
        )
  and not exists
        (select t2.C2
         from T t2
         where t2.C1 = 'the given C1'
           and not exists
                 (select 'x'
                  from T t3
                  where t3.C2 = t2.C2 and t3.C1 = t.C1)
        );
2 07 2011
Charles Hooper

Toon,

Thank you for taking the time to put together a solution for the problem. While I have not tested the performance of the various solutions, my first guess is that the solution that you provided should be more efficient than the two methods I posted that used Cartesian joins, if larger datasets are used.

I executed your version of the SQL statement solution with my current sample data:

select t.C1
from (select distinct C1 from T1 where C1 != 'I') t
where not exists
        (select t2.C2
        from T1 t2
        where t2.C1 = t.C1
          and not exists
                (select 'x'
                  from T1 t3
                  where t3.C2 = t2.C2 and t3.C1 = 'I')
        )
  and not exists
        (select t2.C2
        from T1 t2
        where t2.C1 = 'I'
          and not exists
                (select 'x'
                  from T1 t3
                  where t3.C2 = t2.C2 and t3.C1 = t.C1)
        );
  
C1
----------
XXXI
XCIX
LXXIII
VI
X
XV
LXXI
XXIX
LVII
 
9 rows selected.

II:

 
C1
----------
VIII
LXXVII
LXXVIII
LXXV
XC
XCIII
LXX
XIII
IX
XCVI
LXXXV
C
XXIII
 
13 rows selected.

III:

C1
----------
LXXXVII
XLVI
LII
XCVIII
XLIII
XXX
LXXVI
XVII
XXVIII
LXVI
 
10 rows selected.

Ignoring the sort order, the output is the same. I am not sure that I would have developed the solution that you provided without a couple of more hints.

2 07 2011
Log Buffer #227, A Carnival of the Vanities for DBAs | The Pythian Blog

[…] How would one go about finding all of the unique key values that share the same set of attributes? Charles Hooper […]

4 07 2011
Leonid

You are doing elementwise compare. But we can just count the number of matches, if element values ​​are insignificant.

Like this:

with v1 as
(select * from t1 where c1 != 'I'),
v2 as
(select * from t1 where c1 = 'I'),
v3 as
(select v1.c1, v1.c2, v2.c2 as sign from v1, v2
where v1.c2 = v2.c2 (+)),
v4 as
(select c1, count(c2) as count_all, count(distinct sign) as count_sign
from v3
group by c1)
select c1 from v4 where count_sign = count_all
and count_all = (select count(*) from v2)
order by c1 asc
4 07 2011
Charles Hooper

Leonid,

That is a creative solution to the problem – the SQL statement produces the same output as the above solutions. I even tried removing the row with C1 = ‘I’ and C2 = 1 – the query returned 0 rows as it should. I will have to spend some time determining how the solution works.

4 07 2011
Toon Koppelaars

And just for fun, here’s the solution I hinted with my first post:

select t.C1 as "has same C2-set as the-given-C1"
from (select distinct C1 from T where C1 != 'the given C1') t
where 0 = 
      (select count(*)
       from (select t1.C2 from T t1 where t1.C1 = t.C1
             minus
             select t2.C1 from T t2 where t2.C1 = 'the given C1'))
  and 0 = 
      (select count(*)
       from (select t1.C2 from T t1 where t1.C1 = 'the given C1'
             minus
             select t2.C1 from T t2 where t2.C1 = t.C1))
4 07 2011
Charles Hooper

Toon,

Thank you for sharing.

I executed this statement to create table T with the same data as is found in table T1:

CREATE TABLE T AS SELECT * FROM T1;

When I execute the SQL statement on 11.2.0.2 I see the following error message:

SQL> select t.C1 as "has same C2-set as I"
  2  from (select distinct C1 from T where C1 != 'I') t
  3  where 0 =
  4        (select count(*)
  5         from (select t1.C2 from T t1 where t1.C1 = t.C1
  6               minus
  7               select t2.C1 from T t2 where t2.C1 = 'I'))
  8    and 0 =
  9        (select count(*)
 10         from (select t1.C2 from T t1 where t1.C1 = 'I'
 11               minus
 12               select t2.C1 from T t2 where t2.C1 = t.C1));
       from (select t1.C2 from T t1 where t1.C1 = t.C1
                                                  *
ERROR at line 5:
ORA-00904: "T"."C1": invalid identifier

I wonder if this is a 11.2.0.2 specific issue?

4 07 2011
Toon Koppelaars

Ahh, that’s what you get when you don’t test (mea culpa).
This is the general restriction in Oracle’s SQL implementation: one cannot reference a table alias that is more than one level “up”.
I forgot all about that. So it’s a theoretical solution, not a practical one 😉

4 07 2011
Charles Hooper

Theory is helpful – thank you for explaining the cause of the error message. I have tried (unsuccessfully) a couple of times in the past to do the same thing with table aliases.

(10.2.0.5 returns the same message)

4 07 2011
Rob van Wijk

Charles,

My solution would be to use listagg, combined with analytic functions to use only a single table scan, like this:

SQL> exec dbms_stats.gather_table_stats(user,'t1')

PL/SQL procedure successfully completed.

SQL> var your_value varchar2(10)
SQL> exec :your_value := 'I'

PL/SQL procedure successfully completed.

SQL> select *
  2    from t1
  3   where c1 = :your_value
  4  /

C1                      C2
--------------- ----------
I                        1
I                        2
I                        3
I                        4
I                        5
I                        6
I                        7
I                        8

8 rows selected.

SQL> set serveroutput off
SQL> alter session set statistics_level = all
  2  /

Session altered.

SQL> select c1
  2       , c2_listing
  3    from ( select c1
  4                , listagg(to_char(c2), ',') within group (order by c2) c2_listing
  5                , first_value(listagg(to_char(c2), ',') within group (order by c2)) over (order by decode(c1,:your_value,1,2)) x
  6             from t1
  7            group by c1
  8         )
  9   where c2_listing = x
 10  /

C1              C2_LISTING
--------------- --------------------------------------------------
I               1,2,3,4,5,6,7,8
LXXX            1,2,3,4,5,6,7,8
LXXIX           1,2,3,4,5,6,7,8
IV              1,2,3,4,5,6,7,8
XXXVIII         1,2,3,4,5,6,7,8
XXXV            1,2,3,4,5,6,7,8
XLVIII          1,2,3,4,5,6,7,8
XVIII           1,2,3,4,5,6,7,8
XI              1,2,3,4,5,6,7,8

9 rows selected.

SQL> select * from table(dbms_xplan.display_cursor(null,null,'allstats last'))
  2  /

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------------------------------------
SQL_ID  8hhkguv0pcux7, child number 0
-------------------------------------
select c1      , c2_listing   from ( select c1               ,
listagg(to_char(c2), ',') within group (order by c2) c2_listing
      , first_value(listagg(to_char(c2), ',') within group (order by
c2)) over (order by decode(c1,:your_value,1,2)) x            from t1
       group by c1        )  where c2_listing = x

Plan hash value: 913157036

------------------------------------------------------------------------------------------------------------------
| Id  | Operation            | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |      1 |        |      9 |00:00:00.01 |       3 |       |       |          |
|*  1 |  VIEW                |      |      1 |    100 |      9 |00:00:00.01 |       3 |       |       |          |
|   2 |   WINDOW SORT        |      |      1 |    100 |    100 |00:00:00.01 |       3 | 13312 | 13312 |12288  (0)|
|   3 |    SORT GROUP BY     |      |      1 |    100 |    100 |00:00:00.01 |       3 | 46080 | 46080 |40960  (0)|
|   4 |     TABLE ACCESS FULL| T1   |      1 |    572 |    572 |00:00:00.01 |       3 |       |       |          |
------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("C2_LISTING"="X")


25 rows selected.

Regards,
Rob.

4 07 2011
Charles Hooper

Rob,

Nice potential performance improvement. The FIRST_VALUE function did not occur to me as a possibility.

A lot of creativity in the solutions found in this article. Any other solutions?

Two more possible solutions:
#Start of solution 1: Retrieve the C2 value, its relative position, and the total number of elements:

SELECT
  C1,
  C2,
  ROW_NUMBER() OVER (PARTITION BY C1 ORDER BY C2) RN,
  COUNT(*) OVER  (PARTITION BY C1) CNT
FROM
  T1;

Copy the above twice into inline views. The first inline view will retrieve the values of interest, the second inline view will retrieve all of the other values. The WHERE clause makes certain that we have matching values and positions, and the HAVING clause makes certain that all elements are accounted for:

SELECT
  V2.C1
FROM
  (SELECT
    C1,
    C2,
    ROW_NUMBER() OVER (PARTITION BY C1 ORDER BY C2) RN,
    COUNT(*) OVER  (PARTITION BY C1) CNT
  FROM
    T1
  WHERE
    C1 = 'I') V1,
  (SELECT
    C1,
    C2,
    ROW_NUMBER() OVER (PARTITION BY C1 ORDER BY C2) RN,
    COUNT(*) OVER  (PARTITION BY C1) CNT
  FROM
    T1
  WHERE
    C1 != 'I') V2
WHERE
  V1.C2=V2.C2
  AND V1.RN=V2.RN
  AND V1.CNT=V2.CNT
GROUP BY
  V2.C1
HAVING
  COUNT(*)=MAX(V1.CNT);
 
C1
---------------
XXIX
XCIX
XXXI
LXXIII
VI
XV
X
LXXI
LVII

9 rows selected.

II:

C1
*---------------
XCIII
XXIII
XIII
XCVI
XC
LXXXV
C
LXXVII
VIII
IX
LXXV
LXXVIII
LXX

13 rows selected.

III:

C1
---------------
XLVI
XXVIII
LXVI
XCVIII
XVII
LXXVI
LXXXVII
XLIII
LII
XXX

10 rows selected.

#Start of solution 2, use the *old* method (before LISTAGG was an option) to generate a comma separated list:

SELECT
  C1,
  SUBSTR(SYS_CONNECT_BY_PATH(TO_CHAR(C2),','),2) C2_LISTING
FROM
  (SELECT
    C1,
    C2,
    ROW_NUMBER() OVER (PARTITION BY C1 ORDER BY C2) RN
  FROM
    T1)
CONNECT BY
  PRIOR RN = RN-1
  AND PRIOR C1 = C1
START WITH
  RN = 1;
 
C1              C2_LISTING
--------------- --------------------
C               1
C               1,2
C               1,2,3
C               1,2,3,4
I               1
I               1,2
I               1,2,3
I               1,2,3,4
I               1,2,3,4,5
I               1,2,3,4,5,6
I               1,2,3,4,5,6,7
I               1,2,3,4,5,6,7,8
I               1,2,3,4,5,6,7,8,9
I               1,2,3,4,5,6,7,8,9,10
...

Use a MAX function to retrieve just the complete list for each of the C1 values:

SELECT
  C1,
  MAX(C2_LISTING) C2_LISTING
FROM
  (SELECT
    C1,
    SUBSTR(SYS_CONNECT_BY_PATH(TO_CHAR(C2),','),2) C2_LISTING
  FROM
    (SELECT
      C1,
      C2,
      ROW_NUMBER() OVER (PARTITION BY C1 ORDER BY C2) RN
    FROM
      T1)
  CONNECT BY
    PRIOR RN = RN-1
    AND PRIOR C1 = C1
  START WITH
    RN = 1)
GROUP BY
  C1
ORDER BY
  C1;
 
C1              C2_LISTING
--------------- --------------------
C               1,2,3,4
I               1,2,3,4,5,6,7,8,9,10
II              1,2,3,4
III             1,2,3,4,5,6,7
IV              1,2,3,4,5,6,7,8,9
IX              1,2,3,4
L               1,2,3
... 
XXXV            1,2,3,4,5,6,7,8,9
XXXVI           1,2,3,4,5
XXXVII          1,2,3
XXXVIII         1,2,3,4,5,6

100 rows selected.

Create two inline views using the above logic. The first inline view will retrieve the values of interest, the second inline view will retrieve all of the other values. The WHERE clause simply locates those C2_LISTING values that match the values of interest.

SELECT
  V2.C1,
  V2.C2_LISTING
FROM
(SELECT
  C1,
  MAX(C2_LISTING) C2_LISTING
FROM
  (SELECT
    C1,
    SUBSTR(SYS_CONNECT_BY_PATH(TO_CHAR(C2),','),2) C2_LISTING
  FROM
    (SELECT
      C1,
      C2,
      ROW_NUMBER() OVER (PARTITION BY C1 ORDER BY C2) RN
    FROM
      T1
    WHERE
      C1='I')
  CONNECT BY
    PRIOR RN = RN-1
    AND PRIOR C1 = C1
  START WITH
    RN = 1)
GROUP BY
  C1) V1,
(SELECT
  C1,
  MAX(C2_LISTING) C2_LISTING
FROM
  (SELECT
    C1,
    SUBSTR(SYS_CONNECT_BY_PATH(TO_CHAR(C2),','),2) C2_LISTING
  FROM
    (SELECT
      C1,
      C2,
      ROW_NUMBER() OVER (PARTITION BY C1 ORDER BY C2) RN
    FROM
      T1
    WHERE
      C1 != 'I')
  CONNECT BY
    PRIOR RN = RN-1
    AND PRIOR C1 = C1
  START WITH
    RN = 1)
GROUP BY
  C1) V2
WHERE
  V1.C2_LISTING = V2.C2_LISTING
ORDER BY
  V2.C1;

C1              C2_LISTING
--------------- --------------------
LVII            1,2,3,4,5,6,7,8,9,10
LXXI            1,2,3,4,5,6,7,8,9,10
LXXIII          1,2,3,4,5,6,7,8,9,10
VI              1,2,3,4,5,6,7,8,9,10
X               1,2,3,4,5,6,7,8,9,10
XCIX            1,2,3,4,5,6,7,8,9,10
XV              1,2,3,4,5,6,7,8,9,10
XXIX            1,2,3,4,5,6,7,8,9,10
XXXI            1,2,3,4,5,6,7,8,9,10

9 rows selected.

II:

C1              C2_LISTING
--------------- ----------
C               1,2,3,4
IX              1,2,3,4
LXX             1,2,3,4
LXXV            1,2,3,4
LXXVII          1,2,3,4
LXXVIII         1,2,3,4
LXXXV           1,2,3,4
VIII            1,2,3,4
XC              1,2,3,4
XCIII           1,2,3,4
XCVI            1,2,3,4
XIII            1,2,3,4
XXIII           1,2,3,4

III:

C1              C2_LISTING
--------------- -------------
LII             1,2,3,4,5,6,7
LXVI            1,2,3,4,5,6,7
LXXVI           1,2,3,4,5,6,7
LXXXVII         1,2,3,4,5,6,7
XCVIII          1,2,3,4,5,6,7
XLIII           1,2,3,4,5,6,7
XLVI            1,2,3,4,5,6,7
XVII            1,2,3,4,5,6,7
XXVIII           1,2,3,4,5,6,7
XXX             1,2,3,4,5,6,7

10 rows selected.

There must be at least one more solution to this problem without resorting to PL/SQL. Wasn’t there another way to create a comma separated list prior to the introduction of LISTAGG?

5 07 2011
Rob van Wijk

> Wasn’t there another way to create a comma separated list prior to the introduction of LISTAGG?

Creating a comma separated list is essentially string aggregation. Prior to LISTAGG, I’ve seen 6 different ways to do string aggregation. I mentioned them all in this blogpost: http://rwijk.blogspot.com/2008/05/string-aggregation-with-model-clause.html.
Your “old method” is one of them.

Regards,
Rob.

5 07 2011
Charles Hooper

Rob,

Thank you for the link… I was having trouble with the function names until I saw the OTN thread linked in your blog article. I believe that I was thinking of the COLLECT function, but the OTN thread also shows a couple of examples with the MODEL clause. Since I was looking for a SQL only solution, the COLLECT function is probably not enough without the help of PL/SQL:

SELECT
  C1,
  COLLECT(TO_CHAR(C2)) C2_LISTING
FROM
  T1
GROUP BY
  C1;

C1         C2_LISTING
---------- -------------------------------------------------------
C          SYSTPwdsOE9IeSjCswL2Sefy38w==('1', '4', '3', '2')
I          SYSTPwdsOE9IeSjCswL2Sefy38w==('1', '10', '9', '8', '7', '6', '5', '4', '3', '2')
II         SYSTPwdsOE9IeSjCswL2Sefy38w==('1', '4', '3', '2')
III        SYSTPwdsOE9IeSjCswL2Sefy38w==('1', '7', '6', '5', '4', '3', '2')
IV         SYSTPwdsOE9IeSjCswL2Sefy38w==('1', '9', '8', '7', '6', '5', '4', '3', '2')
IX         SYSTPwdsOE9IeSjCswL2Sefy38w==('1', '4', '3', '2')
L          SYSTPwdsOE9IeSjCswL2Sefy38w==('1', '3', '2')
LI         SYSTPwdsOE9IeSjCswL2Sefy38w==('1', '5', '4', '3', '2')
...
SQL> SELECT
  2    C1,
  3    SUBSTR(C2_LISTING,INSTR(C2_LISTING,'==')+2)
  4  FROM
  5    (SELECT
  6      C1,
  7      COLLECT(TO_CHAR(C2)) C2_LISTING
  8    FROM
  9      T1
 10    GROUP BY
 11      C1);
  SUBSTR(C2_LISTING,INSTR(C2_LISTING,'==')+2)
                          *
ERROR at line 3:
ORA-00932: inconsistent datatypes: expected CHAR got -

A search for other information also found:
If we allow PL/SQL, http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:2196162600402 shows the STRAGG and CONCAT_ALL custom PL/SQL routines.

This article also shows several methods: http://www.oracle-base.com/articles/misc/StringAggregationTechniques.php

There is a key difference in the OTN thread: you demonstrated timing the execution of each method to find the fastest solution, rather than just accepting the first solution that accomplishes the specified task. I suspect that it might also be a good idea to determine the impact on the CPU – is a solution that is twice as fast, but uses 8 times as much CPU time a good solution – it might in some environments, but not in others (for instance in a multi-user system where several sessions may try to concurrently execute the same SQL statement).

5 07 2011
Rob van Wijk

PS: I forgot to mention: the 6 ways are not in that blogpost itself, but in the OTN-thread that I refer to in the link “fastest solution” …

6 07 2011
Jan
SELECT distinct c1  
  FROM(
  SELECT * FROM(
     SELECT t1.*,a1.all_c1_cnt,
            COUNT(DISTINCT a1.c2) OVER (PARTITION BY a1.c1) AS matched_c1_cnt
          FROM(
               SELECT c1,c2,
                 COUNT(DISTINCT c2) OVER (PARTITION BY c1) AS all_c1_cnt
                    FROM t1 WHERE c1='I') a1, 
               t1
         WHERE a1.c1 != t1.c1           
           AND a1.c2=t1.c2)
    WHERE matched_c1_cnt= all_c1_cnt)
      ORDER BY 1;
6 07 2011
Charles Hooper

Jan,

I modified your SQL statement to show a not equal to as !=

If appears that the following line:

COUNT(DISTINCT a1.c2) OVER (PARTITION BY a1.c1) AS matched_c1_cnt

May need to be changed to replace all a1. entries with t1., as shown below (I receive 99 rows for the value of ‘I’ without this change, and 9 rows with the change):

COUNT(DISTINCT t1.c2) OVER (PARTITION BY t1.c1) AS matched_c1_cnt

That is an interesting approach to solving the problem:

SQL> SELECT distinct c1
  2    FROM(
  3    SELECT * FROM(
  4       SELECT t1.*,a1.all_c1_cnt,
  5              COUNT(DISTINCT t1.c2) OVER (PARTITION BY t1.c1) AS matched_c1_cnt
  6            FROM(
  7                 SELECT c1,c2,
  8                   COUNT(DISTINCT c2) OVER (PARTITION BY c1) AS all_c1_cnt
  9                      FROM t1 WHERE c1='I') a1,
 10                 t1
 11           WHERE a1.c1 != t1.c1
 12             AND a1.c2=t1.c2)
 13      WHERE matched_c1_cnt= all_c1_cnt)
 14        ORDER BY 1;

C1
----------
LVII
LXXI
LXXIII
VI
X
XCIX
XV
XXIX
XXXI

9 rows selected.

AS can be seen by the above output, that worked.

I think that there is a risk if all of the elements in the inner-most inline view are a subset of the elements in the outer inline-view (the inline view with != ). This issue is because the equijoin in the WHERE clause will be applied before the COUNT(DISTINCT t1.c2) is calculated:

SQL> SELECT distinct c1
  2    FROM(
  3    SELECT * FROM(
  4       SELECT t1.*,a1.all_c1_cnt,
  5              COUNT(DISTINCT t1.c2) OVER (PARTITION BY t1.c1) AS matched_c1_cnt
  6            FROM(
  7                 SELECT c1,c2,
  8                   COUNT(DISTINCT c2) OVER (PARTITION BY c1) AS all_c1_cnt
  9                      FROM t1 WHERE c1='II') a1,
 10                 t1
 11           WHERE a1.c1 != t1.c1
 12             AND a1.c2=t1.c2)
 13      WHERE matched_c1_cnt= all_c1_cnt)
 14        ORDER BY 1;

C1
----------
C
I
III
IV
IX
LI
LII
LIV
LIX
LV
LVI
LVII
...
XXXI
XXXIV
XXXIX
XXXV
XXXVI
XXXVIII

84 rows selected.

Any suggestions for improvements to make certain that the one set of elements is not merely a subset of the other set of elements?

6 07 2011
Jan

Thanks for your comments, I have fixed that issue with OUTER JOIN:

SELECT DISTINCT c1
      FROM(
      SELECT * FROM(
         SELECT t1.*,a1.all_c1_cnt,
                COUNT(DISTINCT t1.c2) OVER (PARTITION BY t1.c1) AS matched_c1_cnt
              FROM(
                   SELECT c1,c2,
                     COUNT(DISTINCT c2) OVER (PARTITION BY c1) AS all_c1_cnt
                        FROM t1 WHERE c1='II') a1
                  RIGHT JOIN t1
                    ON(a1.c2=t1.c2
                       AND a1.c1 != t1.c1))
       WHERE matched_c1_cnt= all_c1_cnt)
         ORDER BY 1;
6 07 2011
Charles Hooper

Jan,

That small change has fixed the SQL statement so that it produces the same output as the other methods.

Wow – a lot of different ways to solve this type of problem. Fortunately, SQL is not an abbreviation of “Standardized Query Language” – it would have been much too boring to see 10 solutions that were all identical. 🙂

Leave a reply to Toon Koppelaars Cancel reply