SQL – COUNT Analytical Function, GROUP BY, HAVING

26 12 2009

December 26, 2009

A couple years ago the following question appeared on the comp.databases.oracle.misc Usenet group: http://groups.google.com/group/comp.databases.oracle.misc/browse_thread/thread/93bf8d1e75033d4c

I have a book table and in that table it has the book tile, publisher, and type of book it is. example mystery, scifi, etc…

I am trying to write a query that brings back a list of every pair of books that have the same publisher and same book type.  I have been able to get the following code to work:

select publisher_code, type
from book
group by publisher_code, type
having count(*) > 1;

which returns the following results:

PU TYP
-- ---
JP MYS
LB FIC
PE FIC
PL FIC
ST SFI
VB FIC

I can not figure out how to get the book title and book code for the books that this result list represents, everything I have tried throws out an error.

My initial response follows:

I see two possible methods:

  1.  Slide the SQL statement that you have written into an inline view, join the inline view to your book table, and then use the publisher_code, type columns to drive back into your book table.  The join syntax may look like one of the following: (publisher_code, type) IN (SELECT…)   or   b.publisher_code=ib.publisher_code and b.type=ib.type 
  2.  Use analytical functions (COUNT() OVER…) to determine the number of matches for the same publisher_code, type columns.  Then slide this SQL statement into an inline view to retrieve only those records with the aliased COUNT() OVER greater than 1.  This has the benefit of retrieving the matching rows in a single pass.

The original poster then attempted to create a query to meet the requirements, but the query generated an error:

SQL> select title
  2  from book
  3  where publisher_code, type in
  4  (select publisher_code, type
  5  from book
  6  group by publisher_code, type
  7  having count(*) > 1);
where publisher_code, type in
                    *
ERROR at line 3:
ORA-00920: invalid relational operator

My reponse continues:

Very close to what you need.  However, Oracle expects the column names to be wrapped in () … like this:  where (publisher_code, type) in

The above uses a subquery, which may perform slow on some Oracle releases compared to the use of an inline view.  Assume that I have a table named PART, which has columns ID, DESCRIPITION, PRODUCT_CODE, and COMMODITY_CODE, with ID as the primary key.  I want to find ID, DESCRIPTION, and COMMODITY_CODE for all parts with the same DESCRIPTION and PRODUCT_CODE, where there are at least 3 matching parts in the group:

The starting point, which looks similar to your initial query:

SELECT
  DESCRIPTION,
  PRODUCT_CODE,
  COUNT(*) NUM_MATCHES
FROM
  PART
GROUP BY
  DESCRIPTION,
  PRODUCT_CODE
HAVING
  COUNT(*)>=3;

When the original query is slid into an inline view and joined to the original table, it looks like this:

SELECT
  P.ID,
  P.DESCRIPTION,
  P.COMMODITY_CODE
FROM
  (SELECT
    DESCRIPTION,
    PRODUCT_CODE,
    COUNT(*) NUM_MATCHES
  FROM
    PART
  GROUP BY
    DESCRIPTION,
    PRODUCT_CODE
  HAVING
    COUNT(*)>=3) IP,
  PART P
WHERE
  IP.DESCRIPTION=P.DESCRIPTION
  AND IP.PRODUCT_CODE=P.PRODUCT_CODE;

Here is the DBMS_XPLAN:

-------------------------------------------------------------------------------------------------------------------
| Id  | Operation             | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-------------------------------------------------------------------------------------------------------------------
|*  1 |  HASH JOIN            |      |      1 |   1768 |  11525 |00:00:00.21 |    2748 |  1048K|  1048K| 1293K (0)|
|   2 |   VIEW                |      |      1 |   1768 |   1156 |00:00:00.11 |    1319 |       |       |          |
|*  3 |    FILTER             |      |      1 |        |   1156 |00:00:00.11 |    1319 |       |       |          |
|   4 |     HASH GROUP BY     |      |      1 |   1768 |  23276 |00:00:00.08 |    1319 |       |       |          |
|   5 |      TABLE ACCESS FULL| PART |      1 |  35344 |  35344 |00:00:00.04 |    1319 |       |       |          |
|   6 |   TABLE ACCESS FULL   | PART |      1 |  35344 |  35344 |00:00:00.04 |    1429 |       |       |          |
-------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("IP"."DESCRIPTION"="P"."DESCRIPTION" AND "IP"."PRODUCT_CODE"="P"."PRODUCT_CODE")
   3 - filter(COUNT(*)>=3)

The query format using the subquery looks like this:

SELECT
  P.ID,
  P.DESCRIPTION,
  P.COMMODITY_CODE
FROM
  PART P
WHERE
  (DESCRIPTION,PRODUCT_CODE) IN
  (SELECT
    DESCRIPTION,
    PRODUCT_CODE
  FROM
    PART
  GROUP BY
    DESCRIPTION,
    PRODUCT_CODE
  HAVING
    COUNT(*)>=3);

The DBMS_XPLAN, note that Oracle 10.2.0.2 transformed the query above:

-----------------------------------------------------------------------------------------------------------------------
| Id  | Operation             | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------------
|*  1 |  HASH JOIN RIGHT SEMI |          |      1 |      1 |  11525 |00:00:00.21 |    2748 |  1048K|  1048K| 1214K (0)|
|   2 |   VIEW                | VW_NSO_1 |      1 |   1768 |   1156 |00:00:00.12 |    1319 |       |       |          |
|*  3 |    FILTER             |          |      1 |        |   1156 |00:00:00.12 |    1319 |       |       |          |
|   4 |     HASH GROUP BY     |          |      1 |   1768 |  23276 |00:00:00.09 |    1319 |       |       |          |
|   5 |      TABLE ACCESS FULL| PART     |      1 |  35344 |  35344 |00:00:00.04 |    1319 |       |       |          |
|   6 |   TABLE ACCESS FULL   | PART     |      1 |  35344 |  35344 |00:00:00.01 |    1429 |       |       |          |
-----------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("DESCRIPTION"="$nso_col_1" AND "PRODUCT_CODE"="$nso_col_2")
   3 - filter(COUNT(*)>=3)

Without allowing the automatic transformations in Oracle 10.2.0.2, the query takes _much_ longer than 0.21 seconds to complete.

The method using analytical functions starts like this:

SELECT
  P.ID,
  P.DESCRIPTION,
  P.COMMODITY_CODE,
  COUNT(*) OVER (PARTITION BY DESCRIPTION, PRODUCT_CODE) NUM_MATCHES
FROM
  PART P;

Then, sliding the above into an inline view:

SELECT
  ID,
  DESCRIPTION,
  COMMODITY_CODE
FROM
  (SELECT
    P.ID,
    P.DESCRIPTION,
    P.COMMODITY_CODE,
    COUNT(*) OVER (PARTITION BY DESCRIPTION, PRODUCT_CODE) NUM_MATCHES
  FROM
    PART P)
WHERE
  NUM_MATCHES>=3;

The DBMS_XPLAN for the above looks like this:

-----------------------------------------------------------------------------------------------------------------
| Id  | Operation           | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------
|*  1 |  VIEW               |      |      1 |  35344 |  11525 |00:00:00.31 |    1319 |       |       |          |
|   2 |   WINDOW SORT       |      |      1 |  35344 |  35344 |00:00:00.27 |    1319 |  2533K|   726K| 2251K (0)|
|   3 |    TABLE ACCESS FULL| PART |      1 |  35344 |  35344 |00:00:00.04 |    1319 |       |       |          |
-----------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("NUM_MATCHES">=3)

Note that there is only one TABLE ACCESS FULL of the PART table in the above.  The execution time required 0.31 seconds to complete, which is greater than the first two approaches, but that is because the database server is concurrently still trying to resolve the query method using the subquery with no permitted transformations (5+ minutes later).

Subquery method with no transformations permitted:

---------------------------------------------------------------------------------------
| Id  | Operation            | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------
|*  1 |  FILTER              |      |      1 |        |  11525 |00:46:21.46 |      38M|
|   2 |   TABLE ACCESS FULL  | PART |      1 |  35344 |  35344 |00:00:00.25 |    1429 |
|*  3 |   FILTER             |      |  29474 |        |   6143 |00:46:06.52 |      38M|
|   4 |    HASH GROUP BY     |      |  29474 |      1 |    613M|00:33:24.30 |      38M|
|   5 |     TABLE ACCESS FULL| PART |  29474 |  35344 |   1041M|00:00:02.54 |      38M|
---------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter( IS NOT NULL)
   3 - filter(("DESCRIPTION"=:B1 AND "PRODUCT_CODE"=:B2 AND COUNT(*)>=3))

Maxim Demenko provided another possible solution for the problem experienced by the original poster.


Actions

Information

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s




Follow

Get every new post delivered to your Inbox.

Join 141 other followers

%d bloggers like this: