July 1, 1011
A recent thread in the comp.databases.oracle.server Usenet group (actually two threads) asked an interesting question. Assume that you had a detail table that contained several attributes for each of the unique key values. How would one go about finding all of the unique key values that share the same set of attributes? The sample set provided by the OP looks like this:
COL1 COL2 ---- ----- I a I b I c II a II b III a III b III c
For the above, assume that the OP was interested in the attributes of “I”: a,b,c. “II” lacks a “c” attribute, while “III” has the required “a”, “b”, and “c” attributes. So, the OP would like to return C1 value “III” but not “II”. I wonder if there is a simple solution for the OP?
First, let’s create our test data. COL1 appears to contain Roman numbers – if we go beyond the number 3, those could be tricky to generate (unless of course you find the RN format parameter for the TO_CHAR function). Let’s first create a temporary work table that contains the Roman numbers from 1 to 100 and a random number between 1 and 10:
CREATE TABLE T1_TEMP AS SELECT TRIM(TO_CHAR(ROWNUM,'RN')) C1, TRUNC(DBMS_RANDOM.VALUE(1,10)+1) C2, ROWNUM C3 FROM DUAL CONNECT BY LEVEL<=100;
Let’s see what is in the T1_TEMP table:
COLUMN C1 FORMAT A10 SELECT * FROM T1_TEMP ORDER BY C3; C1 C2 C3 ---------- ---------- ---------- I 10 1 II 4 2 III 7 3 IV 9 4 V 8 5 VI 10 6 VII 9 7 VIII 4 8 IX 4 9 X 10 10 ... XCV 5 95 XCVI 4 96 XCVII 8 97 XCVIII 7 98 XCIX 10 99 C 4 100 100 rows selected.
The row with the value “I” in column C1 has the number 10 in column C2, but that number might be a bit different in your temporary work table. Column C2 will determine the number of attributes that are added for each of the values found in column C1 when we create the table T1 (note that we could have defined column C2 with the function CHR(96 + COUNTER) to place lowercase letters in that column, rather than numbers, to help reproduce the OP’s dataset):
CREATE TABLE T1 AS SELECT T1_TEMP.C1, V1.COUNTER C2 FROM T1_TEMP, (SELECT ROWNUM COUNTER FROM DUAL CONNECT BY LEVEL<=10) V1 WHERE T1_TEMP.C2>=V1.COUNTER;
Let’s see what is in table T1:
SELECT * FROM T1 ORDER BY C1, C2; C1 C2 ---------- ---------- C 1 C 2 C 3 C 4 I 1 I 2 I 3 I 4 I 5 I 6 I 7 I 8 I 9 I 10 ... XXXVII 1 XXXVII 2 XXXVII 3 XXXVIII 1 XXXVIII 2 XXXVIII 3 XXXVIII 4 XXXVIII 5 XXXVIII 6 634 rows selected.
From the above output, you can see that we now have the number of rows in table T1 for each distinct value of C1 as was specified in table T1_TEMP. An interesting side-note, the Roman number 100 (C) is less than the Roman number 1 (I) – I guess that explains why computers do not natively use Roman numbers for calculations. 🙂
For the next step, we need to collapse the different C2 values for each of the unique C1 values into a single row. Oracle Database 11.2.0.1 introduced the LISTAGG function that makes easy work of this task, as shown in this earlier blog article.
COLUMN C2_LISTING FORMAT A22 SELECT C1, LISTAGG(TO_CHAR(C2), ',') WITHIN GROUP (ORDER BY C2) C2_LISTING FROM T1 GROUP BY C1 ORDER BY C1; C1 C2_LISTING ---------- -------------------- C 1,2,3,4 I 1,2,3,4,5,6,7,8,9,10 II 1,2,3,4 III 1,2,3,4,5,6,7 IV 1,2,3,4,5,6,7,8,9 IX 1,2,3,4 ... XXXV 1,2,3,4,5,6,7,8,9 XXXVI 1,2,3,4,5 XXXVII 1,2,3 XXXVIII 1,2,3,4,5,6 100 rows selected.
The question remains, how can I find all of the unique C1 values that have all of the same attributes as the C1 value “I” – in this case 1,2,3,4,5,6,7,8,9,10? One method slides the above query into a WITH block and then the WITH block is referenced twice in the main query:
WITH MY_VIEW AS (SELECT C1, LISTAGG(TO_CHAR(C2), ',') WITHIN GROUP (ORDER BY C2) C2_LISTING FROM T1 GROUP BY C1) SELECT V2.C1, V2.C2_LISTING FROM MY_VIEW V1, MY_VIEW V2 WHERE V1.C1='I' AND V1.C1<>V2.C1 AND V1.C2_LISTING=V2.C2_LISTING ORDER BY V2.C1; C1 C2_LISTING ---------- -------------------- LVII 1,2,3,4,5,6,7,8,9,10 LXXI 1,2,3,4,5,6,7,8,9,10 LXXIII 1,2,3,4,5,6,7,8,9,10 VI 1,2,3,4,5,6,7,8,9,10 X 1,2,3,4,5,6,7,8,9,10 XCIX 1,2,3,4,5,6,7,8,9,10 XV 1,2,3,4,5,6,7,8,9,10 XXIX 1,2,3,4,5,6,7,8,9,10 XXXI 1,2,3,4,5,6,7,8,9,10 9 rows selected.
—
How else might you solve the problem posted by the OP in the Usenet thread?
Another way to solve this is to spot that the query is of the following form:
select t.C1
from (select distinct C1 from T) t
where set-of-C2-values-with-t.C1 = set-of-C2-values-with-a-given-C1-value
So it involves set equality, which is not supported in SQL.
But we can apply rewrite-rules to transform above query-text into one that only involves operators that *are* suppored in SQL. Cary has a blog describing how to do that for set equality:
http://carymillsap.blogspot.com/2009/03/last-call-for-c-j-date-course.html
Toon,
Thank you for the comment – it gave me another idea. Why not perform a full outer join between the set with the C1 value equal to ‘I’ and the set that excludes the C1 values that are equal to ‘I’? Once that is done, you could count the NULL values, and throw out any C1 grouping value with at least 1 NULL value. Something like this:
The unfortunate thing is that Oracle treats the outer join as an inner join, and produces the following output (COUNT(V1_C2) was expected to be 10 in all rows using my sample data):
A round-about way of getting around the problem is to first create a Cartesian join between all of the unique C1 values that are not ‘I’ and all of the unique attributes of the rows with a C1 value of ‘I’, as shown below:
Now, sliding the above into an inline view, we are able to full outer join the results of the above with all rows from table T1 where C1 is not ‘I’. Sliding that SQL statement into another inline view allows us to group the results to find the count of the NULL values, and use a HAVING clause to make certain that there are no NULL values on either side of the full outer join:
The above results match the results found in this article. Let’s try again with the value “II” in column C1. First we will return the list using the original SQL statement:
Now trying again with our fancy SQL statement using the full outer joins:
That matches, let’s try again with “III”:
—
Any other solutions to the problem? True set operations?
Another thought using the Oracle supplied set operations. First, a Cartesian join like what was used in the previous example:
Next, we will UNION ALL the above Cartesian join with all values from table T1 that do not have “I” in column C1:
Now if we can group by columns C1 and C2 in the above output, all rows with a COUNT of 2 are of interest, but there may be some cases where there a COUNT of 1 will exist for the same C1 value. Therefore, we need to find those rows with a COUNT of 1 and work backwards:
Now to clean up the above, just listing those C1 values that do not contain the same elements as the C1 value “I”:
Now the final step, using the MINUS set operator to remove those C1 values:
Note that my dataset is not exactly the same as in the article and the previous comment – it is very close, but not identical.
Let’s try the new approach with the “II” values:
The above is slightly different from what we saw earlier, so we will pass the original SQL statement over the dataset for confimation that the new SQL statement works:
Let’s try again with “III”:
The results of the original SQL statement for confirmation:
I see that someone in the Usenet thread proposed using PL/SQL for the solution. Any other pure SQL solutions?
Here’s another pure SQL solution:
Toon,
Thank you for taking the time to put together a solution for the problem. While I have not tested the performance of the various solutions, my first guess is that the solution that you provided should be more efficient than the two methods I posted that used Cartesian joins, if larger datasets are used.
I executed your version of the SQL statement solution with my current sample data:
II:
III:
Ignoring the sort order, the output is the same. I am not sure that I would have developed the solution that you provided without a couple of more hints.
[…] How would one go about finding all of the unique key values that share the same set of attributes? Charles Hooper […]
You are doing elementwise compare. But we can just count the number of matches, if element values are insignificant.
Like this:
Leonid,
That is a creative solution to the problem – the SQL statement produces the same output as the above solutions. I even tried removing the row with C1 = ‘I’ and C2 = 1 – the query returned 0 rows as it should. I will have to spend some time determining how the solution works.
And just for fun, here’s the solution I hinted with my first post:
Toon,
Thank you for sharing.
I executed this statement to create table T with the same data as is found in table T1:
When I execute the SQL statement on 11.2.0.2 I see the following error message:
I wonder if this is a 11.2.0.2 specific issue?
Ahh, that’s what you get when you don’t test (mea culpa).
This is the general restriction in Oracle’s SQL implementation: one cannot reference a table alias that is more than one level “up”.
I forgot all about that. So it’s a theoretical solution, not a practical one 😉
Theory is helpful – thank you for explaining the cause of the error message. I have tried (unsuccessfully) a couple of times in the past to do the same thing with table aliases.
(10.2.0.5 returns the same message)
Charles,
My solution would be to use listagg, combined with analytic functions to use only a single table scan, like this:
Regards,
Rob.
Rob,
Nice potential performance improvement. The FIRST_VALUE function did not occur to me as a possibility.
—
A lot of creativity in the solutions found in this article. Any other solutions?
Two more possible solutions:
#Start of solution 1: Retrieve the C2 value, its relative position, and the total number of elements:
Copy the above twice into inline views. The first inline view will retrieve the values of interest, the second inline view will retrieve all of the other values. The WHERE clause makes certain that we have matching values and positions, and the HAVING clause makes certain that all elements are accounted for:
II:
III:
#Start of solution 2, use the *old* method (before LISTAGG was an option) to generate a comma separated list:
Use a MAX function to retrieve just the complete list for each of the C1 values:
Create two inline views using the above logic. The first inline view will retrieve the values of interest, the second inline view will retrieve all of the other values. The WHERE clause simply locates those C2_LISTING values that match the values of interest.
II:
III:
There must be at least one more solution to this problem without resorting to PL/SQL. Wasn’t there another way to create a comma separated list prior to the introduction of LISTAGG?
> Wasn’t there another way to create a comma separated list prior to the introduction of LISTAGG?
Creating a comma separated list is essentially string aggregation. Prior to LISTAGG, I’ve seen 6 different ways to do string aggregation. I mentioned them all in this blogpost: http://rwijk.blogspot.com/2008/05/string-aggregation-with-model-clause.html.
Your “old method” is one of them.
Regards,
Rob.
Rob,
Thank you for the link… I was having trouble with the function names until I saw the OTN thread linked in your blog article. I believe that I was thinking of the COLLECT function, but the OTN thread also shows a couple of examples with the MODEL clause. Since I was looking for a SQL only solution, the COLLECT function is probably not enough without the help of PL/SQL:
A search for other information also found:
If we allow PL/SQL, http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:2196162600402 shows the STRAGG and CONCAT_ALL custom PL/SQL routines.
This article also shows several methods: http://www.oracle-base.com/articles/misc/StringAggregationTechniques.php
There is a key difference in the OTN thread: you demonstrated timing the execution of each method to find the fastest solution, rather than just accepting the first solution that accomplishes the specified task. I suspect that it might also be a good idea to determine the impact on the CPU – is a solution that is twice as fast, but uses 8 times as much CPU time a good solution – it might in some environments, but not in others (for instance in a multi-user system where several sessions may try to concurrently execute the same SQL statement).
PS: I forgot to mention: the 6 ways are not in that blogpost itself, but in the OTN-thread that I refer to in the link “fastest solution” …
Jan,
I modified your SQL statement to show a not equal to as !=
If appears that the following line:
May need to be changed to replace all a1. entries with t1., as shown below (I receive 99 rows for the value of ‘I’ without this change, and 9 rows with the change):
That is an interesting approach to solving the problem:
AS can be seen by the above output, that worked.
I think that there is a risk if all of the elements in the inner-most inline view are a subset of the elements in the outer inline-view (the inline view with != ). This issue is because the equijoin in the WHERE clause will be applied before the COUNT(DISTINCT t1.c2) is calculated:
Any suggestions for improvements to make certain that the one set of elements is not merely a subset of the other set of elements?
Thanks for your comments, I have fixed that issue with OUTER JOIN:
Jan,
That small change has fixed the SQL statement so that it produces the same output as the other methods.
—
Wow – a lot of different ways to solve this type of problem. Fortunately, SQL is not an abbreviation of “Standardized Query Language” – it would have been much too boring to see 10 solutions that were all identical. 🙂