Full Table Scans and the Buffer Cache in 11.2 – What is Wrong with this Quote?

26 02 2012

February 26, 2012 (Modified February 27, 2012)

I found another interesting quote in the “Oracle Database 11gR2 Performance Tuning Cookbook“, this time related to tables and full table scans.  This quote is found on page 170 of the book:

“If we do an FTS [full table scan], database buffers are used to read all the table data, and this situation may lead to flushing the buffer cache data to make room for the FTS data. To avoid this situation and to limit the consequences on the database buffer cache, the database blocks from FTS operations are put on the top of the LRU (Least Recently Used) list.”

What, if anything, is wrong (and/or right) with the above quote from the book?

Added February 27, 2012:

Part 2:

An additional interesting quote is found on page 176 related to full table scans when indexes are present.  The test case that follows is slightly different than what is presented in the book, however the outcome is the same.  Consider the following table and indexes (note that histograms will be created on columns C1 and C2, and that column C2 will have a single row with a 1 value and 999,999 rows with a 0 value):

CREATE TABLE T1 AS
SELECT
  ROWNUM C1,
  DECODE(ROWNUM,1,1,0) C2,
  LPAD('A',255,'A') C3
FROM
  DUAL
CONNECT BY
  LEVEL<=1000000;

CREATE UNIQUE INDEX IND_T1_C1 ON T1(C1);
CREATE INDEX IND_T1_C2 ON T1(C2);

ALTER TABLE T1 MODIFY (C1 NOT NULL, C2 NOT NULL);

EXEC DBMS_STATS.GATHER_TABLE_STATS(OWNNAME=>USER,TABNAME=>'T1',CASCADE=>TRUE,ESTIMATE_PERCENT=>100,METHOD_OPT=>'FOR ALL INDEXED COLUMNS SIZE 254') 

The test case script:

SET LINESIZE 120
SET PAGESIZE 1000

SELECT
  C1,
  C2
FROM
  T1
WHERE
  C2=1;

SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'TYPICAL'));
 
SELECT
  C1,
  C2
FROM
  T1
WHERE
  C2<>0;

SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'TYPICAL'));

The (slightly reformatted) output from my execution of the above script on Oracle Database 11.2.0.2:

SQL> SELECT
  2    C1,
  3    C2
  4  FROM
  5    T1
  6  WHERE
  7    C2=1;

        C1         C2
---------- ----------
         1          1

SQL_ID  8fv30tbr8jdds, child number 0
-------------------------------------
SELECT   C1,   C2 FROM   T1 WHERE   C2=1

Plan hash value: 236868917

-----------------------------------------------------------------------------------------
| Id  | Operation                   | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |           |       |       |     4 (100)|          |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1        |     1 |     8 |     4   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | IND_T1_C2 |     1 |       |     3   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("C2"=1) 

---

SQL> SELECT
  2    C1,
  3    C2
  4  FROM
  5    T1
  6  WHERE
  7    C2<>0;

        C1         C2
---------- ----------
         1          1

SQL_ID  bu17044puyhkx, child number 0
-------------------------------------
SELECT   C1,   C2 FROM   T1 WHERE   C2<>0

Plan hash value: 3617692013

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |       |       |  3049 (100)|          |
|*  1 |  TABLE ACCESS FULL| T1   |     1 |     8 |  3049   (1)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("C2"<>0)

Notice in the above execution plans that the Oracle query optimizer correctly determined that only 1 row would be returned in both cases, yet in the first case an index was used, and in the second case a full table scan.  The book states the following:

“Why did the database optimizer switch back to a long-running FTS operation, instead of the previous Index Range Scan? The answer is simple – indexes cannot be used when we compare values with a not equal operator.”

I have seen the above answer, with slight variations, provided in at least two other books.  What, if anything, is wrong (and/or right) with the above quote from the book?

The point of blog articles like this one is not to insult authors who have spent thousands of hours carefully constructing an accurate and helpful book, but instead to suggest that readers investigate when something stated does not exactly match what one believes to be true. It could be that the author “took a long walk down a short pier”, or that the author is revealing accurate information which simply cannot be found through other resources (and may in the process be directly contradicting information sources you have used in the past). If you do not investigate in such cases, you may lose an important opportunity to learn something that could prove to be extremely valuable.


Actions

Information

38 responses

26 02 2012
Mich talebzadeh

Hi Charles,

Glad you brought up this

Starting from 9.2 specific for table scans, the small window at the end of the LRU ceased to exist. For full table scans (FTS) blocks now go to Midpoint insertion. Oracle deploys a parameter called small table threshold that allows for special treatment of tables that are smaller than two percent of the size of the buffer cache. Although through my own testings I much doubt that small table threshold holds true.

I believe dealing with FTS depends on two factors:

1. The default behaviour of Oracle 11 optimizer that favours direct path reads in dealing with FTS. Direct read scans are full scans that bypass the buffer cache. Data is read directly from the disk into PGA.
2. Turning off direct path reads. This will force the optimiser to use conventional path reads like db file scattered reads. In that case the way oracle decides to promote table blocks to hot area differs.

These are the summary of my findings:

• With direct path read full table scans for tables below 10% of buffer cache will have their blocks promoted to hot area and tables will be fully cached. This is around 5 times the small table threshold limit
• With conventional path reads and assuming the availability of free buffers, a table of around 95% of buffer size can be cached

Regards,

Mich

27 02 2012
Mich talebzadeh

Just to clarify my sttaement above, that with conventional settings and default optimizer behaviour, with FTS, blocks from any table < 10% of buffer cache will move to hot area. Anythiing larger will be aged out (so called fetch and discard).

Mich

27 02 2012
Jonathan Lewis

Charles,
For the follow-up point:

PLAN_TABLE_OUTPUT
-----------------------------------------------------------------------------------------
SQL_ID  d8031pdcqzhbh, child number 0
-------------------------------------
SELECT /*+ index(t1 (c2)) */   C1,   C2 FROM   T1 WHERE   C2 != 0

Plan hash value: 4220775576

-----------------------------------------------------------------------------------------
| Id  | Operation                   | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |           |       |       |  1825 (100)|          |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1        |     1 |     8 |  1825   (1)| 00:00:22 |
|*  2 |   INDEX FULL SCAN           | IND_T1_C2 |     1 |       |  1824   (1)| 00:00:22 |
-----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("C2" != 0)

With a hint in place I got a plan that used the index (with the necessary full scan, of course).
Interestingly the cost of this plan was lower than the cost of the default plan which, in my case, was an index hash join between the two available indexes. So – the book was right that there was an “obvious” index path that was not taken when it should have been, but it was wrong in that (a) an index path was still taken, and (b) the “obvious” index path was hintable – Oracle can use indexes to satisfy “not equals”.

(The example above actually had ‘less than / greater than’ for rather than ‘!=’, but I changed them while copying the text into this comment in case WordPress took the angle brackets as a format command.)

27 02 2012
Charles Hooper

(Just for clarification, the book selected all columns from the table, and only a single index was present on the equivalent of column C2. However, the statement in the book is written as an absolute.)

Jonathan,

It is interesting that the default plan in your case was an index hash join. If I try to force an index hash join, the calculated cost is a bit higher than that of the full table scan:

SELECT /*+ INDEX_JOIN(T1 (C1) (C2)) */
  C1,
  C2
FROM
  T1
WHERE
  C2!=0;
 
SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'TYPICAL'));
 
SQL_ID  0ys2b1bnd775k, child number 0
-------------------------------------
SELECT /*+ INDEX_JOIN(T1 (C1) (C2)) */   C1,   C2 FROM   T1 WHERE
C2!=0
 
Plan hash value: 1523145522
 
-------------------------------------------------------------------------------------------
| Id  | Operation              | Name             | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |                  |       |       |  4556 (100)|          |
|   1 |  VIEW                  | index$_join$_001 |     1 |     8 |  4556   (1)| 00:00:01 |
|*  2 |   HASH JOIN            |                  |       |       |            |          |
|*  3 |    INDEX FAST FULL SCAN| IND_T1_C2        |     1 |     8 |  2280   (1)| 00:00:01 |
|   4 |    INDEX FAST FULL SCAN| IND_T1_C1        |     1 |     8 |  2624   (1)| 00:00:01 |
-------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access(ROWID=ROWID)
   3 - filter("C2"!=0)

My test was performed with the following (mostly manually set) system statistics:

PNAME                PVAL1
--------------- ----------
STATUS
DSTART
DSTOP
FLAGS                    1
CPUSPEEDNW      3180.65693
IOSEEKTIM               10
IOTFRSPEED            4096
SREADTIM                 8
MREADTIM                10
CPUSPEED              2664
MBRC                    16
MAXTHR            19181568
SLAVETHR

Let’s alter that MBRC system statistic to 8, gather statistics again with NO_INVALIDATE set to FALSE and try the unhinted version of the query again:

EXEC DBMS_STATS.SET_SYSTEM_STATS('MBRC',8)
 
EXEC DBMS_STATS.GATHER_TABLE_STATS(OWNNAME=>USER,TABNAME=>'T1',CASCADE=>TRUE,ESTIMATE_PERCENT=>100,METHOD_OPT=>'FOR ALL INDEXED COLUMNS SIZE 254',NO_INVALIDATE=>FALSE)
 
SELECT
  C1,
  C2
FROM
  T1
WHERE
  C2!=0;
 
SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'TYPICAL'));
 
SQL_ID  bqcyx4u7pjzwa, child number 0
-------------------------------------
SELECT   C1,   C2 FROM   T1 WHERE   C2!=0
  
Plan hash value: 1523145522
 
-------------------------------------------------------------------------------------------
| Id  | Operation              | Name             | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |                  |       |       |  4790 (100)|          |
|   1 |  VIEW                  | index$_join$_001 |     1 |     8 |  4790   (1)| 00:00:01 |
|*  2 |   HASH JOIN            |                  |       |       |            |          |
|*  3 |    INDEX FAST FULL SCAN| IND_T1_C2        |     1 |     8 |  2280   (1)| 00:00:01 |
|   4 |    INDEX FAST FULL SCAN| IND_T1_C1        |     1 |     8 |  2624   (1)| 00:00:01 |
-------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access(ROWID=ROWID)
   3 - filter("C2"!=0)

(<> manually changed to != for posting)

Interesting result… does that count as using the index when an inequality is specified on an indexed column? Richard Foote has an explanation for the FULL TABLE SCAN result.

13 03 2012
Mich talebzadeh

Hi Jonathan, Charles,

I did Charles original test as follows (note I called the table notequal)

CREATE TABLE notequal AS
SELECT
  ROWNUM c1,
  DECODE(ROWNUM,1,1,0) c2,
  LPAD('A',255,'A') c3
FROM
  DUAL
CONNECT BY
  LEVELUSER,TABNAME=>'NOTEQUAL',CASCADE=>TRUE,ESTIMATE_PERCENT=>100,METHOD_OPT=>'FOR ALL INDEXED COLUMNS SIZE 254')
exit

And ran the second query as follows:

alter system flush buffer_cache;
set timing on
set autotrace on
select c1, c2 from notequal where c2 !=0
/
exit

The result showed

System altered.


        C1         C2
---------- ----------
         1          1

Elapsed: 00:00:00.26

Execution Plan
----------------------------------------------------------
Plan hash value: 1217113573

-------------------------------------------------------------------------------------------
| Id  | Operation              | Name             | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |                  |     1 |     8 |  6388   (1)| 00:01:17 |
|   1 |  VIEW                  | index$_join$_001 |     1 |     8 |  6388   (1)| 00:01:17 |
|*  2 |   HASH JOIN            |                  |       |       |            |          |
|*  3 |    INDEX FAST FULL SCAN| IND_NOTEQUAL_C2  |     1 |     8 |  2276   (1)| 00:00:28 |
|   4 |    INDEX FAST FULL SCAN| IND_NOTEQUAL_C1  |     1 |     8 |  2621   (1)| 00:00:32 |
-------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access(ROWID=ROWID)
   3 - filter("C2" !=0)


Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
       3930  consistent gets
       3911  physical reads
          0  redo size
        587  bytes sent via SQL*Net to client
        524  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

So sounds like it uses a hash join between the two indexes to cover the query without touching the base table. However, if we had added c3 then it had to go to table

select c1, c2, c3 from notequal where c2 !=0
System altered.


        C1         C2
---------- ----------
C3
------------------------------------------------------------------------------------------------------------------------------------
         1          1
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA


Elapsed: 00:00:01.47

Execution Plan
----------------------------------------------------------
Plan hash value: 94487321

------------------------------------------------------------------------------
| Id  | Operation         | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |          |     1 |   137 | 10504   (1)| 00:02:07 |
|*  1 |  TABLE ACCESS FULL| NOTEQUAL |     1 |   137 | 10504   (1)| 00:02:07 |
------------------------------------------------------------------------------

So it begs the question if my test results are valid?

13 03 2012
Charles Hooper

Mich,

My MBRC system statistic was set to 16 when I put together this blog article, where “TABLE ACCESS FULL” appeared in the execution plan for the != condition. Jonathan typically uses noworkload system statistics (so that his test case results are easier to reproduce) – the MBRC statistic has an effective default value of 8 when using noworkload system statistics. So, I changed the value of my MBRC to see what would happen:

EXEC DBMS_STATS.SET_SYSTEM_STATS('MBRC',8)

The above change was sufficient to result in the index hash join that Jonathan mentioned.

If you flush the shared pool (or collect object statistics again with NO_INVALIDATE set to FALSE) and then execute the following (my MBRC was set to 16, but I am suggesting that you try 32 instead – I am not sure that this will work):

EXEC DBMS_STATS.SET_SYSTEM_STATS('MBRC',32)

Do you find that the optimizer switches from an index hash join to a full table scan?

13 03 2012
Mich talebzadeh

Charles

My current MBRC is set to 128

show parameter db_file_multiblock;

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
db_file_multiblock_read_count        integer     128

OK flush the shared pool and set DBMS_STATS.SET_SYSTEM_STATS(‘MBRC’,32)

alter system flush buffer_cache;
alter system flush shared_pool;
EXEC DBMS_STATS.SET_SYSTEM_STATS('MBRC',32)
set timing on
set autotrace on
select c1, c2 from notequal where c2 !=0
/
exit

I get the same!

System altered.


System altered.


PL/SQL procedure successfully completed.


        C1         C2
---------- ----------
         1          1

Elapsed: 00:00:00.27

Execution Plan
----------------------------------------------------------
Plan hash value: 1217113573

-------------------------------------------------------------------------------------------
| Id  | Operation              | Name             | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |                  |     1 |     8 |  6388   (1)| 00:01:17 |
|   1 |  VIEW                  | index$_join$_001 |     1 |     8 |  6388   (1)| 00:01:17 |
|*  2 |   HASH JOIN            |                  |       |       |            |          |
|*  3 |    INDEX FAST FULL SCAN| IND_NOTEQUAL_C2  |     1 |     8 |  2276   (1)| 00:00:28 |
|   4 |    INDEX FAST FULL SCAN| IND_NOTEQUAL_C1  |     1 |     8 |  2621   (1)| 00:00:32 |
-------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access(ROWID=ROWID)
   3 - filter("C2" !=0)

The dsame plan as before. So your point “Do you find that the optimizer switches from an index hash join to a full table scan?”. No it does not!

P.S. I tried setting MBRC to 8 but got the same plan.

13 03 2012
Charles Hooper

Mich,

Sorry – I should have stated that if your MBRC system statistic is already greater than 32, then my suggestion is not helpful. I am sure that you are aware of this, but the DB_FILE_MULTIBLOCK_READ_COUNT parameter and the MBRC system statistic have two different effects on costing (Randolf Geist has a couple of good write ups about the effects of the two on costing calculations) – with WORKLOAD system statistics the optimizer will use the MBRC system statistic rather than the DB_FILE_MULTIBLOCK_READ_COUNT parameter value for cost calculations.

What version of Oracle Database are you running? What do you have for the SREADTIM and MREADTIM system statistic:

COLUMN PNAME FORMAT A15
SET PAGESIZE 1000
 
SELECT
  PNAME,
  PVAL1
FROM
  SYS.AUX_STATS$;

There is a bug in 11.2.0.1 and 11.2.0.2 related to the automatic calculations of the SREADTIM and MREADTIM values (mentioned here http://hoopercharles.wordpress.com/2011/05/04/how-to-collect-statistics/ ), and that bug might be affecting your results.

13 03 2012
Mich talebzadeh

Charles,

Thanks for clarification.

Your points
“What version of Oracle Database are you running? What do you have for the SREADTIM and MREADTIM system statistic:”

select * from v$version;

BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
PL/SQL Release 11.2.0.1.0 - Production
CORE    11.2.0.1.0      Production
TNS for Linux: Version 11.2.0.1.0 - Production
NLSRTL Version 11.2.0.1.0 – Production

SELECT
  PNAME,
  PVAL1
FROM
  SYS.AUX_STATS$;

PNAME                               PVAL1
------------------------------ ----------
STATUS
DSTART
DSTOP
FLAGS                                   1
CPUSPEEDNW                      2657.0122
IOSEEKTIM                              10
IOTFRSPEED                           4096
SREADTIM
MREADTIM
CPUSPEED
MBRC                                   32   That is 32 in my case
MAXTHR
SLAVETHR

13 rows selected.

Next I ran as default with trace enabled. Rebooted the server before run

SQL ID: cfsuq8472rpky
Plan Hash: 1217113573
select c1, c2
from
 notequal where c2 !=0


call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch        2      0.15       0.35       3911       3930          0           1
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        4      0.15       0.35       3911       3930          0           1

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 97

Rows     Row Source Operation
-------  ---------------------------------------------------
      1  VIEW  index$_join$_001 (cr=3930 pr=3911 pw=0 time=0 us cost=6388 size=8 card=1)
      1   HASH JOIN  (cr=3930 pr=3911 pw=0 time=0 us)
      1    INDEX FAST FULL SCAN IND_NOTEQUAL_C2 (cr=1826 pr=1818 pw=0 time=0 us cost=2276 size=8 card=1)(object id 87116)
1000000    INDEX FAST FULL SCAN IND_NOTEQUAL_C1 (cr=2104 pr=2093 pw=0 time=639656 us cost=2621 size=8 card=1)(object id 87115)


Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  SQL*Net message to client                       2        0.00          0.00
  Disk file operations I/O                        3        0.00          0.00
  db file sequential read                         2        0.00          0.00
  db file scattered read                         62        0.00          0.12
  SQL*Net message from client                     2        0.00          0.00

So that is 3910 pio and 3939 lio in 350 ms

Now let me try with tablescan hint after reboot

SQL ID: 0w5tn3q1d0bmg
Plan Hash: 94487321
select /*+ FULL(notequal) */ c1, c2
from
 notequal where c2 !=0


call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch        2      0.00       1.14      38463      38468          0           1
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        4      0.01       1.14      38463      38468          0           1

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 97

Rows     Row Source Operation
-------  ---------------------------------------------------
      1  TABLE ACCESS FULL NOTEQUAL (cr=38468 pr=38463 pw=0 time=0 us cost=7479 size=8 card=1)


Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  SQL*Net message to client                       2        0.00          0.00
  Disk file operations I/O                        3        0.00          0.00
  db file sequential read                         1        0.00          0.00
  direct path read                              308        0.00          1.06
  SQL*Net message from client                     2        0.00          0.00

So this one does it with 38,463 pio, 38,468 lio in 1,140 ms

On the face of it the default behaviour (using two indices and hash join) looks a better choice.

13 03 2012
Charles Hooper

Mich,

Thanks for posting the output of SYS.AUX_STATS$. The optimizer for your database is currently using noworkload statistics, and if I am remembering Randolf Geist’s articles correctly, the optimizer will use an effective MBRC system statistic value of 8 if your DB_FILE_MULTIBLOCK_READ_COUNT parameter IS NOT set in the init.ora (or spfile), or a value based on the DB_FILE_MULTIBLOCK_READ_COUNT parameter if that parameter IS set in the init.ora (or spfile). So, if your DB_FILE_MULTIBLOCK_READ_COUNT parameter was autoset to a value of 128, your results should be similar to those indicated by Jonathan above even though you adjusted the MBRC system statistic to a value of 32 using the DBMS_STATS.SET_SYSTEM_STATS procedure.

If you want to see the same results that I observed, you can manually set the same workload system statistics that I used for my test case in this article:

EXEC DBMS_STATS.SET_SYSTEM_STATS('MBRC',16)
EXEC DBMS_STATS.SET_SYSTEM_STATS('CPUSPEED',2664)
EXEC DBMS_STATS.SET_SYSTEM_STATS('SREADTIM',8)
EXEC DBMS_STATS.SET_SYSTEM_STATS('MREADTIM',10)
EXEC DBMS_STATS.SET_SYSTEM_STATS('MAXTHR',19181568)

Oddly, changing the system statistics do not change the “optimizer environment”, and thus you will likely need to force a hard parse of the SQL statement by flushing the shared pool, enabling a 10046 trace, adding a comment/whitespace change, or re-collecting the table statistics with NO_INVALIDATE set to FALSE.

This is one of the interesting problems with test cases – not disclosing all of the influencing parameters/variables that are necessary to reproduce the test results could result in people producing entirely different results. But without the test case script, you might not have any clue that a different result could be achieved than what was described, and thus the need for test case scripts.

13 03 2012
Mich talebzadeh

Great stuff Charles

Going back to the otiginal question. “Does the Oracle Optimizer use an index scan with NOTEQUAL in the predicate”, then the answer would be yes it can use index scan if conditions are satisfied. I understood from Jonathan’s response “– Oracle can use indexes to satisfy “not equals”. as being *it can use index with a hint*. However, here in uses it without a hint and it chooses over a tablescan.Is this a coprrect deduction?

Thanks

13 03 2012
Charles Hooper

Mich,

Oracle can use indexes to satisfy “not equals”. as being *it can use index with a hint*. However, here in uses it without a hint and it chooses over a tablescan.Is this a coprrect deduction?

Yes – your statement agrees with the finding above.

Honestly, when I set up the test case, I was expecting the first half of what you mention above. When I saw Jonathan’s comment, I realized that the test case had a second possible outcome that I had not expected. I could have (probably should have/intended to) also selected column C3 to eliminate the possibility of not needing to visit the table blocks for the query. Sometimes mistakes work out in one’s favor… in this case showing another possible reason why the book author’s (absolutely stated) statement might at times be incorrect.

27 02 2012
Mich talebzadeh

Hi

My understanding in general is that are not optimizable. In other words with NOT EQUAL, we are basically suggesting we are interested in the vast majority of possible values with the exception of the value specified in the NOT EQUAL condition. The CBO has that costing in to ignore the index (I believe)

Now I can naively rewrite Charles query as (apologies table is now called T6)

select c1,c2 from t6 where C2 < 0  OR c2 > 0;

        C1         C2
---------- ----------
         1          1

Elapsed: 00:00:00.00

Execution Plan
----------------------------------------------------------
Plan hash value: 1695387567

------------------------------------------------------------------------------------------
| Id  | Operation                    | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |           |     2 |    32 |     8   (0)| 00:00:01 |
|   1 |  CONCATENATION               |           |       |       |            |          |
|   2 |   TABLE ACCESS BY INDEX ROWID| T6        |     1 |    16 |     4   (0)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN          | IND_T6_C2 |     1 |       |     3   (0)| 00:00:01 |
|   4 |   TABLE ACCESS BY INDEX ROWID| T6        |     1 |    16 |     4   (0)| 00:00:01 |
|*  5 |    INDEX RANGE SCAN          | IND_T6_C2 |     1 |       |     3   (0)| 00:00:01 |
------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("C2"0)
       filter(LNNVL("C2"<0))

Obviously it can only do two index range scans and pretty fast. Now if I compare this with select c1,c2 from t6 where c2 <> 0 Iget

  1* select c1,c2 from t6 where c2 <> 0
scratchpad@MYDB.MICH.LOCAL> /

        C1         C2
---------- ----------
         1          1

Elapsed: 00:00:00.19

Execution Plan
----------------------------------------------------------
Plan hash value: 1930642322

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |     1 |    16 | 10504   (1)| 00:02:07 |
|*  1 |  TABLE ACCESS FULL| T6   |     1 |    16 | 10504   (1)| 00:02:07 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("C2" <> 0)

The result with c2 <> 0 involving two index range scans is much faster.

Mich

Modified the first query per Mich’s request below. CH Feb 27, 2012

27 02 2012
Mich talebzadeh

my query above should read

 select c1,c2 from t6 where c2  0;  where C2 is less than 0  OR c2 is greater than 0

        C1         C2
---------- ----------
         1          1

Elapsed: 00:00:00.00

Execution Plan
----------------------------------------------------------
Plan hash value: 1695387567

------------------------------------------------------------------------------------------
| Id  | Operation                    | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |           |     2 |    32 |     8   (0)| 00:00:01 |
|   1 |  CONCATENATION               |           |       |       |            |          |
|   2 |   TABLE ACCESS BY INDEX ROWID| T6        |     1 |    16 |     4   (0)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN          | IND_T6_C2 |     1 |       |     3   (0)| 00:00:01 |
|   4 |   TABLE ACCESS BY INDEX ROWID| T6        |     1 |    16 |     4   (0)| 00:00:01 |
|*  5 |    INDEX RANGE SCAN          | IND_T6_C2 |     1 |       |     3   (0)| 00:00:01 |
------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("C2"0)
       filter(LNNVL("C2"<0))
28 02 2012
Charles Hooper

Jonathan’s comment is important – just because my test case results demonstrate one particular behavior – that single result is often insufficient to generalize the result to an unconditional statement of what will happen when provide a particular input. Simply altering the MBRC system statistic (and flushing the execution plan from memory) was sufficient in this case to change the execution plan from a full table scan to an index hash join.

Below are my review notes for the two quotes (these review notes were written before this blog article was posted):
Part 1:
The book states, “If we do an FTS [full table scan], database buffers are used to read all the table data, and this situation may lead to flushing the buffer cache data to make room for the FTS data. To avoid this situation and to limit the consequences on the database buffer cache, the database blocks from FTS operations are put on the top of the LRU (Least Recently Used) list.” This statement requires significant adjustment before it is an accurate statement. Is the author describing the behavior of Oracle Database 8.1 or Oracle Database 11.2 as indicated on the front cover of the book? What is meant by the “top of the LRU list” – is that the MRU (most recently used) end? If the author meant that the blocks were placed on the least recently used end of the LRU list, then the author agrees with the Oracle Database 10.2 documentation (http://docs.oracle.com/cd/B19306_01/server.102/b14220/memory.htm#sthref1280), but that documentation is incorrect (the behavior changed around the time 9.0.1 was released, if my notes are correct). The Oracle Database 11.2 documentation (http://docs.oracle.com/cd/E11882_01/server.112/e25789/memory.htm#BABCCJCB) states that blocks read by full table scan are placed at the mid-point of the LRU list (if the CACHE keyword is specified when the table is created or altered, then the table blocks will be placed on the MRU end of the list). Since the book is specifically about Oracle Database 11.2, it is worth pointing out that since the release of Oracle Database 11.1, the Oracle RDBMS often makes use of serial direct path read when performing full table scans, and that type of access completely avoids reading the table blocks into the buffer cache (the blocks are read into the PGA). Oracle event 10949 may be used to disable serial direct path read. What about parallel full table scans of larger tables? Those too will avoid flooding the buffer cache with blocks that may only be accessed once. Smaller tables will certainly have their blocks read into the buffer cache, but the touch count associated with each of the table’s blocks will limit the problems that those blocks will cause in the buffer cache (reference https://forums.oracle.com/forums/thread.jspa?threadID=2189618).

Part 2:
The book states, “The answer is simple – indexes cannot be used when we compare values with a not equal operator.” To clarify, the book is describing a situation involving a single table, when the only predicate on the indexed column is the inequality comparison. Adding an INDEX hint to the author’s sample SQL statement results in an INDEX FULL SCAN operation – the hint allows the SQL statement containing the not equal predicate to use the MY_CUSTOMERS_IXVALID index. It is not the case that the index cannot be used with an inequality comparison, however, the query optimizer does not automatically consider an access path involving the index due to the general assumption that many rows will be returned when all except one value is selected. To avoid the INDEX FULL SCAN operation, where the index structure is read one block at a time, the inequality could be converted to a less than predicate and a greater than predicate with a logical OR between the two predicates (reference http://richardfoote.wordpress.com/2008/08/13/indexes-and-not-equal-not-now-john/).

28 02 2012
Mich talebzadeh

Just to address part 1, I think was the author is mentioning is rather confused. Even before 9.2 there was no such thing as “are put on the top of the LRU (Least Recently Used) list”. I did some drawings on this for pre 9.2 and post 9.2 but cannot load it. The behaviour pre 9.2 can be stated as:

Until release 9.2, Oracle used an MRU/LRU algorithm. Blocks within the buffer cache were ordered from MRU (most recently used) blocks to LRU (least recently used) blocks. Whenever a block was accessed, the block went to the MRU end of the list, thereby shifting the other blocks down towards the LRU end. When a block was read from disk and when there was no buffer available in the db buffer cache, one block in the buffer cache had to “leave”. It was the block on the LRU end in the list. Blocks read during a full table scan were placed on the LRU side of the list instead of the MRU side. As an example if the buffer cache consisted of 100,000 blocks and a table of size 200,000 blocks was full table scanned, accessing every block, the blocks would be read into memory in a batch size defined by the parameter DB_FILE_MULTIBLOCK_READ_COUNT. These blocks would go immediately to the least used end of the MRU/LRU chain thereby preventing a large table scan from overwriting the entire buffer cache. Instead, only a small number of blocks (i.e. 2 to 32) would be overwritten leaving the remaining cached table blocks from other segments still in cache. Each pool’s LRU is divided into a hot area and a cold area. Accordingly, buffers within the hot area are hot buffers and the ones in the cold are called cold buffers. By default, 50% of the buffers belong to the cold area and the other 50% belong to the hot area. A newly read single data block will be inserted between the cold and the hot area such that it belongs to the hot area. This is called midpoint insertion. However, this is only true for single block reads. Multi block reads such as in table scan, were placed at the LRU end, read and discarded without being promoted.

29 02 2012
Mich talebzadeh

With regard to part 2, what the book states “The answer is simple – indexes cannot be used when we compare values with a not equal operator.”

It is NOT that indexes cannot be used. It is more to do with the fact that the optimizer costing does not consider index with a NOT EQUAL operator. However, if the data is not evenly spread (skewed) for the predicate and the NOT EQUAL condition only satisfies a small proportion of the rows then an index with an index hint can be successfully used.

1 03 2012
Charles Hooper

Mich,

Well stated. Richard Foote’s blog agrees with your comment:

http://richardfoote.wordpress.com/2008/08/13/indexes-and-not-equal-not-now-john/

1 03 2012
Mich talebzadeh

I know at least two engines (Orace and Sybase) that have optimizer costing ignoring NOT EQUAL condition and I did both tests In Oracle and Sybase using the same table and index ensuring that only 10% of data say satified the condition you will get with != sign.

Just to highlight the point

1> select n1,count(n1) from t5 group by n1
2> go
 
 n1
 --- -----------
   0       98400
   1         390
   2         369
   3         417
   4         424

Now if I try a default optimizer behaviour Iget

 

4> select padding from t5 where n1 != 0 (not equal to 0)
5> go
   
STEP 1
        The type of query is SELECT.

        1 operator(s) under root

       |ROOT:EMIT Operator (VA = 1)
       |
       |   |SCAN Operator (VA = 0)
       |   |  FROM TABLE
       |   |  t5
       |   |  Table Scan.
       |   |  Forward Scan.
       |   |  Positioning at start of table.
       |   |  Using I/O Size 64 Kbytes for data pages.
       |   |  With LRU Buffer Replacement Strategy for data pages.


Total estimated I/O cost for statement 2 (at line 4): 28704.

Parse and Compile Time 0.
Adaptive Server cpu time: 0 ms.
Table: t5 scan count 1, logical reads: (regular=3592 apf=0 total=3592), physical reads: (regular=0 apf=0 total=0), apf IOs used=0
Total actual I/O cost for this command: 7184.
Total writes for this command: 0

Execution Time 0.
Adaptive Server cpu time: 0 ms.  Adaptive Server elapsed time: 6 ms.

However, if I force it to use index t5_i1 with columns n1, ind_pad, n2

select padding from t5 (index t5_i1) where n1 != 0

it will come back

    STEP 1
        The type of query is SELECT.

        1 operator(s) under root

       |ROOT:EMIT Operator (VA = 1)
       |
       |   |SCAN Operator (VA = 0)
       |   |  FROM TABLE
       |   |  t5
       |   |  Index : t5_i1
       |   |  Forward Scan.
       |   |  Positioning at index start.   <-- using index here
       |   |  Using I/O Size 64 Kbytes for index leaf pages.
       |   |  With LRU Buffer Replacement Strategy for index leaf pages.
       |   |  Using I/O Size 64 Kbytes for data pages.
       |   |  With LRU Buffer Replacement Strategy for data pages.


Total estimated I/O cost for statement 2 (at line 4): 37111.

Parse and Compile Time 0.
Adaptive Server cpu time: 0 ms.
Table: t5 scan count 1, logical reads: (regular=1515 apf=0 total=1515), physical reads: (regular=0 apf=0 total=0), apf IOs used=0
Total actual I/O cost for this command: 3030.
Total writes for this command: 0

Execution Time 0.
Adaptive Server cpu time: 0 ms.  Adaptive Server elapsed time: 1 ms.

Apologies for using non Oracle in this case and I trust I have not been pedantic. Just wanted to emphasise a general optimizer costing shared between these two engines. The clue lies in the difference between “Total estimated I/O cost” of 37,111 with index versus 28,704 with table scan. So the CBO will always assume that using index is going to be more expensive than table scan (in this case). In terms of actual I/O costing, we have 3,030 with index versus 7,184 with table scans.That only happened because we told the optimizer to use the index and we knew about the spread of the data.

Cheers,

Mich

2 03 2012
Charles Hooper

MIch,

Thank you for running the test on Sybase.

Now you have me curious about something regarding Sybase. Without a histogram, Oracle’s optimizer would not know that roughly 10% of the rows were not equal to 0 – is there an equivalent to Oracle’s histograms on Sybase?

As a person will (almost) no knowledge of Sybase, the estimated cost of the index access path at 37111 is a bit higher than the estimated cost of the full table scan at 28704 – is there a way to artifiically lower the calculated cost of an index access path on Sybase, similar to the effects of the OPTIMIZER_INDEX_COST_ADJ parameter on Oracle Database (Sybase seems to be counting individual blocks when calculating costs, while Oracle’s optimizer costs multi-block reads), so that you could see if Sybase might actually select the index access path without a hint?

6 03 2012
Mich talebzadeh

Hi Charles,

I wanted to do some further tests to ensure the reason why after having 90% of data in my table != 0, Sybase was not using the index. Of course Sybase using CBO has to rely on histogram (table level and column level stats, ) and various cluster ratios to give an estimate of what it expects.

So first we have the following composition of distrubution of data in t5

1> select n1, count(n1) Occurance from t5 group by n1
2> go
 n1  Occurance
 --- -----------
   0       92011
   1        2063
   2        1985
   3        1973
   4        1968

(5 rows affected)

OK I updated stats as usual for table and all indexed columns

1> update index statistics t5
2> go

Now the problem i(as you also alluded to it) was the estimate of physical I/Os for using the index including large I/O and the asociated CPU overhead! The index scan with the (n1 != 0) predicate eliminated majority of the need to read the data pages However, the optimizer’s cost estimate could not be 100% accurate here due to the “random” mapping between index rows and data pages (blocks). Also it had to take into account another parameter called “data page cluster ratio” (dpcr)
For an index, dpcr measures the effectiveness of large I/O for accessing the table using this index. In this case after 90% data update, this ratio went down (1 is good 0 is bad, see below). So the optimizer overestimated the index plan (IO and CPU overhead) and favourd a table scan! I updated 90% of data but we did not coalesce and rebuild index. Much like Oracle an index rebuild should reduce index range scan I/O: So I checked the cluster ratio before and after I did “reorg rebuild” in sybase on that table

WITHOUT reorg rebuild after 90% of row updates, I had:

dpcr
—————————
0.461519

With REORG REBUILD this was improved to

dpcr
—————————
0.944368

Then I ran the query without any hint or anything


select padding from t5 (index 0) where n1 != 0
go

    STEP 1
        The type of query is SELECT.

        1 operator(s) under root

       |ROOT:EMIT Operator (VA = 1)
       |
       |   |SCAN Operator (VA = 0)
       |   |  FROM TABLE
       |   |  t5
       |   |  Index : t5_i1
       |   |  Forward Scan.
       |   |  Positioning at index start.
       |   |  Using I/O Size 64 Kbytes for index leaf pages.
       |   |  With LRU Buffer Replacement Strategy for index leaf pages.
       |   |  Using I/O Size 64 Kbytes for data pages.
       |   |  With LRU Buffer Replacement Strategy for data pages.


Total estimated I/O cost for statement 1 (at line 1): 17485.

Parse and Compile Time 0.
Adaptive Server cpu time: 0 ms.

==================== Lava Operator Tree ====================


            Emit
            (VA = 1)
            r:7989 er:7990
            cpu: 100


 /
IndexScan
t5_i1
(VA = 0)
r:7989 er:7990
l:973 el:929
p:124 ep:130

============================================================
Table: t5 scan count 1, logical reads: (regular=973 apf=0 total=973), physical reads: (regular=8 apf=116 total=124), apf IOs used=115
Total actual I/O cost for this command: 5046.
Total writes for this command: 0

Execution Time 1.
Adaptive Server cpu time: 100 ms.  Adaptive Server elapsed time: 41 ms.
 padding
 --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

(7989 rows affected)

Execution Time 0.
Adaptive Server cpu time: 0 ms.  Adaptive Server elapsed time: 43 ms.

It just used the index. In actual fact it did 973 logical reads. There were 124 physical reads of which 8 were 8k reads and 116 were (apf) Asynchronous Pre Fetch (multi block reads). In this case these were 64k index page reads. Sybase uses a unit-less wighting factor IO cost = All physical IOs X 25 + All logical IOs X 2 + CPU cost X 0.1. Essentially it is a weighting factor applied to the (estimated) logical and physical IO done by the query plus CPU overhead. In this case 2 units for a logical IO, 25 units for a physical IO and 0.1 unit for CPU. The assumption is that data has to be read in from disk and physical IO may include large page reads.

So the estimate was (from trace file, yes it is as exciting as Oracle’s raw trace file!)

( PopRidJoin cost:17485.48 T(L928.5442,P129.2677,C123967) O(L263.5442,P45.13951,C23967) props: [{}] ( PopIndScan cost:13433.21 T(L665,P84.12823,C100000) O(L665,P84.12823,C100000) props: [{}] Gti1( t5_i1 ) Gtt0( t5 ) ) cost:13433.21 T(L665,P84.12823,C100000) O(L665,P84.12823,C100000) order: none
) cost:17485.48 T(L928.5442,P129.2677,C123967) O(L263.5442,P45.13951,C23967) order: none

PopRidJoin refers to the physical operator (Pop) joining the index to the table through the Row ID. This follows the index scan operator PopIndScan. Much like Oracle, each operator’s output is the input to its parent operator. Here PopRidJoin is the parent operator of PopIndScan that provides the RID. You use the index to get to the rows via rowID and then in each row you get the column “padding”. The 0(L.., P.., C..) is the logical I/O, Physical I/O and CPU numbers for the top operator under serial costing.

This model seems to overestimate the CPU costing! In total it puts cost at 17485.48. With 928 logical I/O, 129 Physical I/O and 123967 CPU usage. That is IO cost = 928 * 2 + 129 * 25 + 123967 * 0.1 = 17477.7, roughly what it gets (17485). The optimizer considers table scan as well. But the cost is higher now

        ( PopTabScan t5 ) cost:28174.8 T(L3226,P405,C115978) O(L3226,P405,C115978) order: none

So table scan estimsated cost is now 28174.8 compared to index usage of 17485). Table scan estimate costing is 3226 logical I/O with 405 Physical I/O plus 115978 CPU usage

Back to index usage, in practice it uses no CPU time at all. It says

Table: t5 scan count 1, logical reads: (regular=973 apf=0 total=973), physical reads: (regular=8 apf=116 total=124), apf IOs used=115
Total actual I/O cost for this command: 5046.
Execution Time 0.
Adaptive Server cpu time: 0 ms.  Adaptive Server elapsed time: 43 ms.

Use that formula again I get 973 * 2 + 124 * 25 + 0 *.1 = 5046.0 which is what it says.

Rather long winded but worth exploring. Now the challenge for me is to make my Oracle t5 table to use index without any hint or costing adjustment.

Cheers,

Mich

7 03 2012
Mich talebzadeh

I am trying to do the same with the same base table in Oracle. That is making it to use the index once 90% of n1 values are set to 0. just to catchup

CREATE TABLE T5
                (
                        n1 NUMBER(5)    NOT NULL,
                        ind_pad VARCHAR2(40) NOT NULL,
                        n2 NUMBER(5)    NOT NULL,
                        small_vc VARCHAR2(10) NOT NULL,
                        padding  VARCHAR2(200) NOT NULL
                 );

create index t5_i1 on t5(n1, ind_pad, n2)
nologging
pctfree 91;

Now use the following proc to populate t5 with 100,000 rows:

CREATE OR REPLACE PROCEDURE populate_T5_sp
(
        i_rows  IN              NUMBER
)
AS
  v_rowcount                    NUMBER(12) := 0;
  v_T5Id                        NUMBER(37) := NULL;
  e_rows                        EXCEPTION;
  e_insert_T5                   EXCEPTION;
BEGIN
  BEGIN                 -- Input parameter validation
    -- Check for inputs
    IF i_rows <= 0
    THEN
      RAISE e_rows;
    END IF;
  EXCEPTION
    WHEN e_rows THEN
      DBMS_OUTPUT.PUT_LINE('Cannot have null value for no of records to generate!');
  END;
  BEGIN
    dbms_random.seed(0);
    SELECT NVL(MAX(small_vc),0)+1 INTO v_T5Id from T5;
    for v_records IN 1..i_rows LOOP
      INSERT INTO T5
      (
        n1,
        ind_pad,
        n2,
        small_vc,
        padding
      )
      VALUES
      (
        --trunc(dbms_random.value(0,25)),
        trunc(dbms_random.value(0,5)),
        rpad('x',39)||'x',
        trunc(dbms_random.value(0,20)),
        lpad(v_T5Id,10,'0'),
        rpad('x',199)||'x'
      );
      IF SQL%ROWCOUNT = 0 OR SQLCODE  0
      THEN
        RAISE e_insert_T5;
      END IF;
      v_rowcount := v_rowcount + 1;
      v_T5Id := v_T5Id + 1;
      --dbms_output.put_line(chr(10)||'v_T5Id = '|| v_T5Id);
      IF MOD(v_rowcount, 10000) = 0 THEN
        --dbms_output.put_line('v_rowcount = '|| v_rowcount ||' ,committing!');
        COMMIT;
      END IF;
    END LOOP;
    COMMIT;
  EXCEPTION
    WHEN e_insert_T5 THEN
      DBMS_OUTPUT.PUT_LINE('Could not insert into T5');
    WHEN OTHERS THEN
      RAISE_APPLICATION_ERROR(SQLCODE,SQLERRM,true);
  END;
END populate_T5_sp;
/
-- Populate the table with 100,000 rows
exec populate_T5_sp (100000)

PL/SQL procedure successfully completed.

 exec dbms_stats.gather_table_stats(ownname=>user, tabname=>'T5', cascade=>true)

PL/SQL procedure successfully completed.

select n1, count(n1) occurance from t5 group by n1 order by n1;

        N1  OCCURANCE
---------- ----------
         0      19788
         1      19899
         2      20099
         3      20187
         4      20027

Now set 90% of records with n1 = 0

 update t5 set n1 = 0 where rownum  exec dbms_stats.gather_table_stats(ownname=>user, tabname=>'T5', cascade=>true)

PL/SQL procedure successfully completed.

scratchpad@MYDB.MICH.LOCAL> select n1, count(n1) occurance from t5 group by n1 order by n1;

        N1  OCCURANCE
---------- ----------
         0      91939
         1       2016
         2       2052
         3       2024
         4       1969

OK now use trace to see what is going on.

First do the query with no modification

select padding
from
 t5 where n1 != 0


call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch      539      0.01       0.16       3715       4238          0        8060
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total      541      0.01       0.17       3715       4238          0        8060

Misses in library cache during parse: 0
Optimizer mode: ALL_ROWS
Parsing user id: 97

Rows     Row Source Operation
-------  ---------------------------------------------------
   8060  TABLE ACCESS FULL T5 (cr=4238 pr=3715 pw=0 time=7799 us cost=1025 size=16320000 card=80000)


Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  SQL*Net message to client                     539        0.00          0.00
  Disk file operations I/O                        3        0.00          0.00
  db file sequential read                         3        0.00          0.00
  db file scattered read                         46        0.00          0.11
  SQL*Net message from client                   539        0.04          0.63
********************************************************************************

Ok 170 ms with 3715 block reads. With 46 pio that roughly works out around 80 multi-block read from table for each pio

Now what if I hint index.

select /*+ index (t5, t5_i1) */ padding from t5 where n1 != 0

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch      539      0.02       0.19       2599       9843          0        8061
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total      541      0.02       0.19       2599       9843          0        8061

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 97

Rows     Row Source Operation
-------  ---------------------------------------------------
   8061  TABLE ACCESS BY INDEX ROWID T5 (cr=9843 pr=2599 pw=0 time=210730 us cost=6264 size=1694016 card=8304)
   8061   INDEX FULL SCAN T5_I1 (cr=2654 pr=2224 pw=0 time=12610 us cost=1368 size=0 card=8304)(object id 86623)


Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  SQL*Net message to client                     539        0.00          0.00
  Disk file operations I/O                        3        0.00          0.00
  db file scattered read                        294        0.00          0.10
  SQL*Net message from client                   539        0.04          0.50
  db file parallel read                          51        0.00          0.01
  db file sequential read                        35        0.00          0.00

That is almost on par with table scan. So not very useful as table scan outweights index usage

But I know that the leaf level of this index is highly disordered after 90,000 row updates. A rebuild, drop/create of the index may be all that is needed, but not a guarantee. The key value here is efficiency of this index. That is a measure of how many block jumps are taken reading the base table while scanning this index. I guess this is a highly simplified explanation. The point is that these block jumps will be minimal if the ordering nature of the index is closely the same as the base table. Ok I drop and recreate that index again followed by stats update. Then run the quert again with index hint

select /*+ index (t5, t5_i1) */ padding from t5 where n1 != 0

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch      539      0.03       0.10       1206       8520          0        8060
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total      541      0.03       0.10       1206       8520          0        8060

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 97

Rows     Row Source Operation
-------  ---------------------------------------------------
   8060  TABLE ACCESS BY INDEX ROWID T5 (cr=8520 pr=1206 pw=0 time=82409 us cost=5521 size=1632000 card=8000)
   8060   INDEX FULL SCAN T5_I1 (cr=1334 pr=832 pw=0 time=59532 us cost=804 size=0 card=8000)(object id 86630)


Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  SQL*Net message to client                     539        0.00          0.00
  Disk file operations I/O                        3        0.00          0.00
  db file scattered read                        119        0.00          0.03
  SQL*Net message from client                   539        0.03          0.43
  db file parallel read                          49        0.00          0.01
  db file sequential read                        36        0.00          0.00
********************************************************************************

Ok that is now better. The number of disk reads has gone down from 2,599 blocks (with fragmented index) to 1,206 blocks after index rebuild. With the index hint after rebuil. the elapsed time is down from 170 ms for table scan to 100 ms for index scan.

11 03 2012
Mich talebzadeh

As Charles suggested I tried to adjust the costing to favour index and run my query. This was done as before after updating 90% of rows for n1=0 and dropping and recreating the index t5_i1

ALTER SESSION SET OPTIMIZER_INDEX_COST_ADJ=10;

Session altered.

select padding from t5 where n1 != 0;

8060 rows selected.

Elapsed: 00:00:00.04

Execution Plan
----------------------------------------------------------
Plan hash value: 2002323537

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |  8000 |  1593K|  1024   (1)| 00:00:13 |
|*  1 |  TABLE ACCESS FULL| T5   |  8000 |  1593K|  1024   (1)| 00:00:13 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("N1"!=0)


Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
       4238  consistent gets
          0  physical reads
          0  redo size
     139536  bytes sent via SQL*Net to client
       6430  bytes received via SQL*Net from client
        539  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
       8060  rows processed

So even at 10% cost adjustment, the optimizer still favours index scan.

So far (AFAIK) we have seen examples where index hint can provide faster results. However, does that mean that without an index hint the optimizer is going to choose FTS with != predicate under all conditions?

Thanks,

Mich

11 03 2012
Charles Hooper

Mich,

Thank you for providing the experiment where you compared Sybase with Oracle Database – a case where an index access is the obvious choice only for Sybase. If I recall correctly, Oracle’s optimizer does not even calculate the cost of the index access path for an inequality – the optimizer just assumes that the index access path will be inefficient.

I believe that Richard Foote covered the inequality scenario here:

http://richardfoote.wordpress.com/2008/08/13/indexes-and-not-equal-not-now-john/

3 03 2012
Ashok Kumar

Guys you rock here.

Very interesting stuff for learners like me.

Can you please clarify if CBO always makes estimated costing on the basis that the data has to be brought in from persistent storage (disk). This could be single or multi-block reads and would ignore whether some of the data or all of the data is already in the buffer?

Regards

Ashok

3 03 2012
Charles Hooper

Ashok,

The simplified answer to your question is that the query optimizer’s calculated costing assumes that all blocks accessed will require physical reads from disk. Oracle Database provides the OPTIMIZER_INDEX_CACHING parameter to allow the optimizer to make assumptions about index block caching for in-list operations and nested loops operations (see http://docs.oracle.com/cd/E14072_01/server.112/e10820/initparams161.htm ).

Randolf Geist has a couple of articles that may be of interest to you:

http://oracle-randolf.blogspot.com/2011/07/cost-is-time-next-generation.html

http://oracle-randolf.blogspot.com/2009/04/understanding-different-modes-of-system.html

12 03 2012
Mich talebzadeh

Charles,

Thanks for the opportunity to experiment with some additional stuff and learn more.

I made a statement to the effect that “..I know at least two engines (Orace and Sybase) that have optimizer costing ignoring NOT EQUAL condition..”. Not true. One of them honors the index when conditions favour the use of an index. So I was wrong here and I was wrong to state that in my Transact SQL book. Also I tried your original test case “CREATE TABLE T1 AS SELECT ROWNUM C1 ..” on Sybase and that used the index as well. Obviously with Oracle the optimizer just ignores the index by default.

Mich

18 03 2012
talebzadeh

Guys,

I have something that I like someone to shed some lights on it for me.

Just to recap the broad agrrement is that the optimizer can use an index with not equals, but it would not in general and it would not make sense in general unless the data was so massively skewed. Perhaps it would be easier to point out that the optimizer will not use an index range scan for a not equals.

So Optimizer can use an index fast full scan. That is read the index as if it were a table using multiblock IO, does not read it in sorted order, just reads the blocks as it hits them.

If we have above does Oracle do some prefetching of the blocks at all to speed up the procressing? I am coming from this angle that in Sybase we see:

       The type of query is SELECT.

        1 operator(s) under root

       |ROOT:EMIT Operator (VA = 1)
       |
       |   |SCAN Operator (VA = 0)
       |   |  FROM TABLE
       |   |  notequal
       |   |  Index : ind_notequal_c2
       |   |  Forward Scan.
       |   |  Positioning at index start.
       |   |  Index contains all needed columns. Base table will not be read.
       |   |  Using I/O Size 64 Kbytes for index leaf pages.
       |   |  With LRU Buffer Replacement Strategy for index leaf pages.

Total estimated I/O cost for statement 1 (at line 2): 167164.

Parse and Compile Time 0.
Adaptive Server cpu time: 0 ms.
 c2
 -----------
           1

==================== Lava Operator Tree ====================


            Emit
            (VA = 1)
            r:1 er:2
            cpu: 100


 /
IndexScan
ind_notequal_c2
(VA = 0)
r:1 er:2
l:2569 el:2571
p:381 ep:2481

============================================================
Table: notequal scan count 1, logical reads: (regular=2569 apf=0 total=2569), physical reads: (regular=8 apf=373 total=381), apf IOs used=373

So we have the equivalent of index fast full scan starting at index start and using 64K reads for each physivcal IO. In this case my database is 8k page (think block) and in one IO it does 8 index pages. Index pages are denser than data pages. However, this large IO read serves another purpose. There is a term APF for Asynchronous Pre Fetch or read ahead pages. At the bottom it say physical reads total = 381 of which there were 373 APF and 8 regular. That means that Sybase dispatches the IO request *ahead* of when it is actually needed. So the only time it has to wait for an APF is when it gets to a page that should have been there for an APF and it had not been prefetched yet. So the whole idea of an APF is to asynchronously prefetch pages so that when the execution engine gets to that page, it is a logical IO vs. a physical IO. This is supposed to speed up the process.

I am pretty sure Oracle can do the same. However, I am looking around in the raw trace file for some signal. Jonathan in his blog http://jonathanlewis.wordpress.com/2006/12/15/index-operations/ mentioned “Check the 10053 trace and you will probably see a line like “index prefetching enabled” somewhere. However, I am afraid I had little luck with this.

Aprreciate if there is an input on this.

Mich

18 03 2012
Charles Hooper

Mich,

I believe that your statement is correct with a small adjustment:

it would be easier to point out that the optimizer will not use an index range scan for a not equals [unless hinted to use the index].

So Optimizer can use an index fast full scan. That is read the index as if it were a table using multiblock IO, does not read it in sorted order, just reads the blocks as it hits them.

That statement agrees with my understanding.

If we have above does Oracle do some prefetching of the blocks at all to speed up the procressing?

I am not sure that I understand this statement. During an index fast full scan, the runtime engine will read multiple blocks at a time up to the number of blocks specified for the DB_FILE_MULTIBLOCK_READ_COUNT (the actual number of blocks read in a single read request could be smaller based on extent sizes and blocks that are already in the buffer cache). As far as I am aware, this multi-block read is not considered “prefetching”, which seems to be more commonly associated with batching multiple block reads (for instance during an INDEX FULL SCAN) that would otherwise result in db file sequential reads into either db file scattered reads (logically adjacent blocks) or db file parallel reads (logically non-adjacent blocks) during an index range scan. The number of logically non-adjacent blocks read using a db file parallel read is controlled by the _DB_FILE_NONCONTIG_MBLOCK_READ_COUNT hidden parameter, which seems to default to a value of 11 blocks in 11.2.0.2. If I remember correctly, _DISABLE_INDEX_BLOCK_PREFETCHING may be set to TRUE to disable this optimization, and there may also be an Oracle EVENT that also disables the combining of multiple non-adjacent block reads into a single db file parallel read request. See the following OTN thread for a discussion between Timur Akhmadeev and me regarding index prefetching:

https://forums.oracle.com/forums/thread.jspa?threadID=941864

Jonathan Lewis also nicely summarized prefetching on page 82 of this article:

http://jonathanlewis.files.wordpress.com/2008/07/ls2.pdf

We are able to see prefetching with this example (using the same table and index created at the start of this article):

DROP TABLE T1 PURGE;
 
CREATE TABLE T1 AS
SELECT
  ROWNUM C1,
  DECODE(ROWNUM,1,1,0) C2,
  LPAD('A',255,'A') C3
FROM
  DUAL
CONNECT BY
  LEVEL<=1000000;
 
CREATE UNIQUE INDEX IND_T1_C1 ON T1(C1);
CREATE INDEX IND_T1_C2 ON T1(C2);
 
ALTER TABLE T1 MODIFY (C1 NOT NULL, C2 NOT NULL);
 
EXEC DBMS_STATS.GATHER_TABLE_STATS(OWNNAME=>USER,TABNAME=>'T1',CASCADE=>TRUE,ESTIMATE_PERCENT=>100,METHOD_OPT=>'FOR ALL INDEXED COLUMNS SIZE 254')
 
ALTER SYSTEM FLUSH BUFFER_CACHE;

I will process the following query in the DBMS Xplan viewer in my Hyper-Extended Oracle Performance Monitor (http://hoopercharles.wordpress.com/2012/03/12/thoughts-on-a-hyper-extended-oracle-performance-monitor-beta/ ) with an enabled 10046 and 10053 trace:

SELECT /*+ INDEX(T1 IND_T1_C1) */
  C1,
  C2,
  SUBSTR(C3,1,10) C3
FROM
  T1
WHERE
  C1 BETWEEN 1 AND 10000;

The execution plan output follows:

SELECT /*+ INDEX(T1 IND_T1_C1) */
  C1,
  C2,
  SUBSTR(C3,1,10) C3
FROM
  T1
WHERE
  C1 BETWEEN 1 AND 10000;
 
SQL_ID  5fg6g9cyg1hht, child number 0
-------------------------------------
SELECT /*+ INDEX(T1 IND_T1_C1) */    C1,    C2,    SUBSTR(C3,1,10) C3  
FROM    T1  WHERE    C1 BETWEEN 1 AND 10000
 
Plan hash value: 683303157
 
------------------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name      | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |
------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |           |      1 |        |  10000 |00:00:00.06 |     599 |    448 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1        |      1 |   9998 |  10000 |00:00:00.06 |     599 |    448 |
|*  2 |   INDEX RANGE SCAN          | IND_T1_C1 |      1 |   9998 |  10000 |00:00:00.01 |     122 |     40 |
------------------------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
--------------------------------------------------- 
   2 - access("C1">=1 AND "C1"<=10000)

The 10046 portion of the trace file includes the following, which demonstrates prefetching of index and table blocks:

PARSING IN CURSOR #424214864 len=114 dep=0 uid=64 oct=3 lid=64 tim=1009009341711 hv=1022411289 ad='3eb8c0770' sqlid='5fg6g9cyg1hht'
SELECT /*+ INDEX(T1 IND_T1_C1) */ 
  C1, 
  C2, 
  SUBSTR(C3,1,10) C3 
FROM 
  T1 
WHERE 
  C1 BETWEEN 1 AND 10000
END OF STMT
PARSE #424214864:c=31200,e=27003,p=0,cr=0,cu=0,mis=1,r=0,dep=0,og=1,plh=683303157,tim=1009009341710
EXEC #424214864:c=0,e=36,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=683303157,tim=1009009341844
WAIT #424214864: nam='SQL*Net message to client' ela= 2 driver id=1413697536 #bytes=1 p3=0 obj#=523 tim=1009009341900
WAIT #424214864: nam='db file scattered read' ela= 634 file#=8 block#=33056 blocks=8 obj#=103840 tim=1009009342618    --- index
WAIT #424214864: nam='db file scattered read' ela= 639 file#=8 block#=68720 blocks=8 obj#=103840 tim=1009009343432    --- index
WAIT #424214864: nam='db file scattered read' ela= 616 file#=7 block#=99176 blocks=8 obj#=103839 tim=1009009344176    --- table
FETCH #424214864:c=0,e=2442,p=24,cr=7,cu=0,mis=0,r=100,dep=0,og=1,plh=683303157,tim=1009009344373
WAIT #424214864: nam='SQL*Net message from client' ela= 792 driver id=1413697536 #bytes=1 p3=0 obj#=103839 tim=1009009345209
WAIT #424214864: nam='SQL*Net message to client' ela= 1 driver id=1413697536 #bytes=1 p3=0 obj#=103839 tim=1009009345270
WAIT #424214864: nam='db file scattered read' ela= 499 file#=8 block#=216 blocks=8 obj#=103839 tim=1009009345865
FETCH #424214864:c=0,e=770,p=8,cr=6,cu=0,mis=0,r=100,dep=0,og=1,plh=683303157,tim=1009009346013
WAIT #424214864: nam='SQL*Net message from client' ela= 496 driver id=1413697536 #bytes=1 p3=0 obj#=103839 tim=1009009346539
WAIT #424214864: nam='SQL*Net message to client' ela= 1 driver id=1413697536 #bytes=1 p3=0 obj#=103839 tim=1009009346591
FETCH #424214864:c=0,e=212,p=0,cr=6,cu=0,mis=0,r=100,dep=0,og=1,plh=683303157,tim=1009009346778
WAIT #424214864: nam='SQL*Net message from client' ela= 583 driver id=1413697536 #bytes=1 p3=0 obj#=103839 tim=1009009347388
WAIT #424214864: nam='SQL*Net message to client' ela= 0 driver id=1413697536 #bytes=1 p3=0 obj#=103839 tim=1009009347426
WAIT #424214864: nam='db file scattered read' ela= 544 file#=7 block#=99184 blocks=8 obj#=103839 tim=1009009348050    --- table
FETCH #424214864:c=0,e=741,p=8,cr=6,cu=0,mis=0,r=100,dep=0,og=1,plh=683303157,tim=1009009348155
WAIT #424214864: nam='SQL*Net message from client' ela= 545 driver id=1413697536 #bytes=1 p3=0 obj#=103839 tim=1009009348729
WAIT #424214864: nam='SQL*Net message to client' ela= 1 driver id=1413697536 #bytes=1 p3=0 obj#=103839 tim=1009009348785
FETCH #424214864:c=0,e=127,p=0,cr=6,cu=0,mis=0,r=100,dep=0,og=1,plh=683303157,tim=1009009348898
WAIT #424214864: nam='SQL*Net message from client' ela= 484 driver id=1413697536 #bytes=1 p3=0 obj#=103839 tim=1009009349402
WAIT #424214864: nam='SQL*Net message to client' ela= 1 driver id=1413697536 #bytes=1 p3=0 obj#=103839 tim=1009009349453
WAIT #424214864: nam='db file scattered read' ela= 520 file#=8 block#=224 blocks=8 obj#=103839 tim=1009009350076      --- table
FETCH #424214864:c=0,e=749,p=8,cr=7,cu=0,mis=0,r=100,dep=0,og=1,plh=683303157,tim=1009009350179
WAIT #424214864: nam='SQL*Net message from client' ela= 515 driver id=1413697536 #bytes=1 p3=0 obj#=103839 tim=1009009350717
WAIT #424214864: nam='SQL*Net message to client' ela= 1 driver id=1413697536 #bytes=1 p3=0 obj#=103839 tim=1009009350761
FETCH #424214864:c=0,e=130,p=0,cr=5,cu=0,mis=0,r=100,dep=0,og=1,plh=683303157,tim=1009009350878
WAIT #424214864: nam='SQL*Net message from client' ela= 484 driver id=1413697536 #bytes=1 p3=0 obj#=103839 tim=1009009351383
WAIT #424214864: nam='SQL*Net message to client' ela= 1 driver id=1413697536 #bytes=1 p3=0 obj#=103839 tim=1009009351438
WAIT #424214864: nam='db file scattered read' ela= 481 file#=7 block#=99192 blocks=8 obj#=103839 tim=1009009352017    --- table
FETCH #424214864:c=0,e=761,p=8,cr=6,cu=0,mis=0,r=100,dep=0,og=1,plh=683303157,tim=1009009352176
WAIT #424214864: nam='SQL*Net message from client' ela= 555 driver id=1413697536 #bytes=1 p3=0 obj#=103839 tim=1009009352760
WAIT #424214864: nam='SQL*Net message to client' ela= 1 driver id=1413697536 #bytes=1 p3=0 obj#=103839 tim=1009009352822
FETCH #424214864:c=0,e=188,p=0,cr=6,cu=0,mis=0,r=100,dep=0,og=1,plh=683303157,tim=1009009352986
WAIT #424214864: nam='SQL*Net message from client' ela= 648 driver id=1413697536 #bytes=1 p3=0 obj#=103839 tim=1009009353661
WAIT #424214864: nam='SQL*Net message to client' ela= 0 driver id=1413697536 #bytes=1 p3=0 obj#=103839 tim=1009009353738
WAIT #424214864: nam='db file scattered read' ela= 448 file#=8 block#=15816 blocks=8 obj#=103839 tim=1009009354243    --- table
FETCH #424214864:c=0,e=667,p=8,cr=6,cu=0,mis=0,r=100,dep=0,og=1,plh=683303157,tim=1009009354389
WAIT #424214864: nam='SQL*Net message from client' ela= 583 driver id=1413697536 #bytes=1 p3=0 obj#=103839 tim=1009009354995
WAIT #424214864: nam='SQL*Net message to client' ela= 1 driver id=1413697536 #bytes=1 p3=0 obj#=103839 tim=1009009355052
FETCH #424214864:c=0,e=162,p=0,cr=7,cu=0,mis=0,r=100,dep=0,og=1,plh=683303157,tim=1009009355189
WAIT #424214864: nam='SQL*Net message from client' ela= 519 driver id=1413697536 #bytes=1 p3=0 obj#=103839 tim=1009009355730
WAIT #424214864: nam='SQL*Net message to client' ela= 1 driver id=1413697536 #bytes=1 p3=0 obj#=103839 tim=1009009355799
WAIT #424214864: nam='db file scattered read' ela= 620 file#=8 block#=15824 blocks=8 obj#=103839 tim=1009009356510    --- table
FETCH #424214864:c=0,e=990,p=8,cr=6,cu=0,mis=0,r=100,dep=0,og=1,plh=683303157,tim=1009009356763
WAIT #424214864: nam='SQL*Net message from client' ela= 636 driver id=1413697536 #bytes=1 p3=0 obj#=103839 tim=1009009357433
WAIT #424214864: nam='SQL*Net message to client' ela= 1 driver id=1413697536 #bytes=1 p3=0 obj#=103839 tim=1009009357514
FETCH #424214864:c=0,e=233,p=0,cr=5,cu=0,mis=0,r=100,dep=0,og=1,plh=683303157,tim=1009009357724
WAIT #424214864: nam='SQL*Net message from client' ela= 633 driver id=1413697536 #bytes=1 p3=0 obj#=103839 tim=1009009358392
WAIT #424214864: nam='db file scattered read' ela= 507 file#=8 block#=15832 blocks=8 obj#=103839 tim=1009009359053    --- table
WAIT #424214864: nam='SQL*Net message to client' ela= 0 driver id=1413697536 #bytes=1 p3=0 obj#=103839 tim=1009009359096
FETCH #424214864:c=0,e=702,p=8,cr=5,cu=0,mis=0,r=100,dep=0,og=1,plh=683303157,tim=1009009359217
WAIT #424214864: nam='SQL*Net message from client' ela= 617 driver id=1413697536 #bytes=1 p3=0 obj#=103839 tim=1009009359864
WAIT #424214864: nam='SQL*Net message to client' ela= 1 driver id=1413697536 #bytes=1 p3=0 obj#=103839 tim=1009009359910
FETCH #424214864:c=0,e=114,p=0,cr=6,cu=0,mis=0,r=100,dep=0,og=1,plh=683303157,tim=1009009360008
WAIT #424214864: nam='SQL*Net message from client' ela= 551 driver id=1413697536 #bytes=1 p3=0 obj#=103839 tim=1009009360580
WAIT #424214864: nam='SQL*Net message to client' ela= 1 driver id=1413697536 #bytes=1 p3=0 obj#=103839 tim=1009009360640
WAIT #424214864: nam='db file scattered read' ela= 522 file#=8 block#=15840 blocks=8 obj#=103839 tim=1009009361276    --- table
FETCH #424214864:c=0,e=819,p=8,cr=7,cu=0,mis=0,r=100,dep=0,og=1,plh=683303157,tim=1009009361433
WAIT #424214864: nam='SQL*Net message from client' ela= 724 driver id=1413697536 #bytes=1 p3=0 obj#=103839 tim=1009009362181
WAIT #424214864: nam='SQL*Net message to client' ela= 1 driver id=1413697536 #bytes=1 p3=0 obj#=103839 tim=1009009362263
WAIT #424214864: nam='db file scattered read' ela= 420 file#=8 block#=15848 blocks=8 obj#=103839 tim=1009009362900    --- table
FETCH #424214864:c=0,e=707,p=8,cr=6,cu=0,mis=0,r=100,dep=0,og=1,plh=683303157,tim=1009009362946
WAIT #424214864: nam='SQL*Net message from client' ela= 510 driver id=1413697536 #bytes=1 p3=0 obj#=103839 tim=1009009363479
WAIT #424214864: nam='SQL*Net message to client' ela= 1 driver id=1413697536 #bytes=1 p3=0 obj#=103839 tim=1009009363529
FETCH #424214864:c=0,e=169,p=0,cr=6,cu=0,mis=0,r=100,dep=0,og=1,plh=683303157,tim=1009009363671
WAIT #424214864: nam='SQL*Net message from client' ela= 548 driver id=1413697536 #bytes=1 p3=0 obj#=103839 tim=1009009364300
WAIT #424214864: nam='SQL*Net message to client' ela= 1 driver id=1413697536 #bytes=1 p3=0 obj#=103839 tim=1009009364357
WAIT #424214864: nam='db file scattered read' ela= 487 file#=8 block#=15856 blocks=8 obj#=103839 tim=1009009364997    --- table
FETCH #424214864:c=0,e=708,p=8,cr=6,cu=0,mis=0,r=100,dep=0,og=1,plh=683303157,tim=1009009365040
WAIT #424214864: nam='SQL*Net message from client' ela= 525 driver id=1413697536 #bytes=1 p3=0 obj#=103839 tim=1009009365591
WAIT #424214864: nam='SQL*Net message to client' ela= 1 driver id=1413697536 #bytes=1 p3=0 obj#=103839 tim=1009009365650
FETCH #424214864:c=0,e=207,p=0,cr=5,cu=0,mis=0,r=100,dep=0,og=1,plh=683303157,tim=1009009365831
WAIT #424214864: nam='SQL*Net message from client' ela= 506 driver id=1413697536 #bytes=1 p3=0 obj#=103839 tim=1009009366369
WAIT #424214864: nam='SQL*Net message to client' ela= 1 driver id=1413697536 #bytes=1 p3=0 obj#=103839 tim=1009009366423
WAIT #424214864: nam='db file scattered read' ela= 330 file#=8 block#=33064 blocks=8 obj#=103840 tim=1009009366871    --- index
WAIT #424214864: nam='db file scattered read' ela= 440 file#=8 block#=15864 blocks=8 obj#=103839 tim=1009009367425    --- table

I did not find the words “index prefetching”, or “prefetch” in the 10053 portion of the trace file, so I suspect that notation has been removed from recent versions of 10053 trace files.

25 03 2012
talebzadeh

Hi Charles,

Thank you for your explanation and example. I think multi-block IO *is* prefetching by definition. So I still have this point if I may

Now wiyh multi-block IO by time the query engine gets to that block it is a kind of logical IO as opposed to physical IO?

That means that Oracle dispatches the IO request *ahead* of when it is actually needed. So the only time it has to wait for a prefetched block is when it gets to a block that should have been there for multi-block IO and it had not been prefetched yet. So the whole idea of a multi-block IO is to asynchronously prefetch blocks so that when the query engine gets to that block, it is a logical IO vs. a physical IO?

Mich

27 03 2012
Charles Hooper

Mich,

Sorry for the delay in response.

After reading the following blog article, I am a bit more inclined to agree with your statement that multi-block IO is prefetching by definition:

http://oracle-randolf.blogspot.com/2011/07/logical-io-evolution-part-2-9i-10g.html

However, by extension of the same definition (“multi-block IO is prefetching by definition”), reading an entire 8KB database block (when only the first row in the block is currently of interest) could also be considered “prefetching by definition”. ;-)

Your second point might need a little clarification because logical IOs may or may not result in physical IOs. It might be more accurate to state that multi-block reads help reduce the average time to access each block when physical block reads are required – this is especially true if hard drive heads must be repositioned (or the required delay of waiting for the hard drive platter to spin around again – 0.004 seconds for a 15,000 RPM drive). With a multi-block read (when physical hard drives are used, with no built-in read ahead cache), most of the latency in the multi-block physical read (assuming a db file scattered read or direct path read) occurs when accessing the first block, while the rest of the blocks accessed in the single read request are obtained *mostly* with very little additional latency.

Some hard drives (and drive controllers) have special algorithms to reduce hard drive head movement (and rotational) latency when multiple block requests are in the read/write queue (sometimes called native command queuing). It seems that non-logically-adjacent blocks that are prefetched with DB FILE PARALLEL READ through a prefetch operation are designed to leverage the read/write command queuing algorithms to reduce the average physical block read time: http://docs.oracle.com/cd/E11882_01/server.112/e17110/waitevents003.htm#autoId30

There are probably better (and more accurate) ways to state the above.

28 03 2012
Mich Talebzadeh

Charles,

Many thanks as usual for your analysis.

It seems that I am hearing more gurus using the term multi-block IO is prefetching and I believe every DBMS that does prefetching does so asynchronously. So I wish we could use asynchronous prefetch as the naming convention :-)
So we have now established that with full table scans and fast full index scans, Oracle will deploy prefetching where it can. So there is obviously a part of execution engine that does the prefetching and there is another part that does the query work?

When the query part of the execution engine (is this correct?) gets to the block needed and it is not there, then it has to wait. That block could already be a dirty buffer. In that case, I assume that it will have to go to undo to get that block which will incur an additional physical io?

Mich

2 04 2012
Charles Hooper

Mich,

Sorry again for the delayed response – I have been preoccupied with non-Oracle Database items lately.

I am not sure that I would classify prefetching as an asynchronous operation – that could cause a bit of confusion. Asynchronous I/O is controlled by the DISK_ASYNCH_IO parameter (and the FILESYSTEMIO_OPTIONS parameter):

http://docs.oracle.com/cd/E11882_01/server.112/e10803/config_db.htm#BGBIHHBD

As shown at the above web page, the V$IOSTAT_FILE view may be queried to determine if asynchronous I/O is enabled for various file types. A slightly modified version of the SQL statement executed on a 11.2.0.2 Windows database instance:

SELECT DISTINCT
  FILETYPE_NAME,
  ASYNCH_IO
FROM
  V$IOSTAT_FILE
ORDER BY
  FILETYPE_NAME;
 
FILETYPE_NAME                ASYNCH_IO
---------------------------- ---------
Archive Log                  ASYNC_OFF
Archive Log Backup           ASYNC_OFF
Control File                 ASYNC_OFF
Data File                    ASYNC_ON
Data File Backup             ASYNC_OFF
Data File Copy               ASYNC_OFF
Data File Incremental Backup ASYNC_OFF
Data Pump Dump File          ASYNC_OFF
Flashback Log                ASYNC_OFF
Log File                     ASYNC_OFF
Other                        ASYNC_OFF
Temp File                    ASYNC_ON

The primary reason why I do not consider prefetching to be an asynchronous operation is because the Oracle process requires one or more of the requested blocks from disk in order to continue processing (unless there are exceptions that I am not considering) – processing for the process halts until the required block is read from disk. Some people will argue that reading a block from memory is 1000 times faster than reading the same block from disk (others will argue that it is closer to 100 times faster due to all of the code that executes during a logical read) – this speed difference, coupled with the fact that something provoked the Oracle process to request the pre-fetched blocks from disk (maybe one of the 16 blocks read is known to be required), means that the session’s process likely is not number crunching asynchronously while a block read request completes.

It is not sufficient to state that dirty buffers require undo to be applied – a dirty buffer is a block that was modified since being read from disk. The data block could have been written to the datafiles when there were uncommitted changes in the block. Undo is stored in a tablespace, just like the regular table and index blocks, and the undo blocks also share the buffer cache with the regular table and index blocks. The requirement to apply undo to logically roll a block back to a specific SCN does not necessarily require an additional physical I/O.

9 04 2012
Mich Talebzadeh

Hi Charles,

Many thanks for your explanation. I have come to conclusion that your point about prefetch not being asynchronous is valid, although I still think with full table scan and index fast full scan where Oracle reads the index as if it were a table using multiblock IO, does not read it in sorted order there is some asynchronous prefteching going on.

I was actually looking at an issue of deadlock on indexes with two procs running at the same time and browsing Steve Adams internal book on latches etc and came across an statement that with STATISTICS_LEVEL = ALL, there are far more latches that Oracle takes. In contrast, it appears that when the parameter is set to TYPICAL (the default), the rowsource/plan execution statistics are disabled. With statistics level = all, a typical query runs many times slower and I gather it is due to system overheads. Although I have not seen any particular explanation for it except that do not use STATISTICS_LEVEL = ALL unless you have to!

Cheers,

Mich

9 04 2012
Charles Hooper

Mich,

I had heard previously that most of the overhead associated with setting the STATISTICS_LEVEL to all is associated with system calls to obtain the current time of day (gettimeofday). This particular issue is mentioned in Metalink (MOS) Doc ID 436797.1, and seems to be more severe prior to Oracle Database 11.1. There is a summary in the “Expert Oracle Practices” book on page 313 that demonstates the impact of the TYPICAL and ALL values of STATISTICS_LEVEL for a SQL statement that performs a large number of nested loops joins:

http://books.google.com/books?id=tUNtyMCwDWQC&pg=PA313

I will have to look into the increase of of the number of latch gets with STATISTICS_LEVEL=ALL – I had not considered latching as a potential threat.

The deadlock on indexes that you mention – might that just be a sign of two sessions trying to insert rows with the same primary key value into two different tables?

9 04 2012
Mich Talebzadeh

Charles,

Your point on “The deadlock on indexes that you mention – might that just be a sign of two sessions trying to insert rows with the same primary key value into two different tables?”

Slighlly more interesting than that. The same proc running with two different sets of parameters trying to do update on the same table with one normal b-tree index. The issue is that the proc was doing multiple updates and depending on the set size (chosen by parameters) at part of the code was doing a table scan and other parts were using the index. THe deadlock was one instance of proc doing a table scan and the other was using the index (much smaller result set).

Steve Adams in his book p 28-29 mentions to the effect that

“Oracle expects latches to be held only briefly and intermittently. Otherwise if the use of latch is not brief or intermittent then there can be contention (remember they do not allow multiple processes to inspect the protected data concurrently). It makes sense to expect that a single Oracle process to hold a number of latches concurrently. Therefore there is a possibility of latching deadlock occurring. One process holding latch A and another thread holding latch B and both processes spinning [on their semaphore] and waiting for the alternate latch. Oracle ensures that this cannot happen by ensuring that latches are always taken in a defined order when more than one latch is needed. To support this, Oracle must maintain different latch levels. When a process attempts to get a latch, a check is made to ensure that it is not already holding a latch of the same level or latch of higher level. A process that needs a higher level latch has to sleep while holding a lower level latch. In this scenario, the lower level latches are held much longer than normal. In general if this happens, it can result in a deadlock.”

Interesting point but still investigating it.

Mich

12 04 2012
Charles Hooper

Mich,

Sorry again for the delay in my response. It seems that I am having a bit of difficulty finding time to respond to comments lately due to a number of non-Oracle related activities. Windows Update is broken on a relative’s computer, and as a side effect, I cannot install the latest version of the virus scanner brand that I have used since 1995. I have been fighting that problem for almost a week. Lots of other non-Oracle computer related things too.

Hanging head in shame I have not yet had a chance to read the book that Steve Adams wrote. I did manage to find his book in the Google Books library:

http://books.google.com/books?id=KG4j7NuHbRsC&pg=PA28

What is shown in the Google Books library seems to indicate that your quote above is a paraphrase of the book – not necessarily a bad paraphrase. For instance, the last sentence on page 29 states ( http://books.google.com/books?id=KG4j7NuHbRsC&pg=PA29 ):

“In general, if this rule is broken, an ORA-600 [504] internal error is raised.”

The description of your problem regarding deadlocking on indexes seems to point a little more directly at enqueues (locks) rather than latches (I could of course be completely wrong). Jonathan Lewis took an advanced look at the differences between enqueues and latches in his recent “Oracle Core” book:

http://books.google.com/books?id=G9AJA91PL54C&pg=PA59

The AskTom site has also covered the differences between latches and enqueues several times (you may not be able to click this link):

http://asktom.oracle.com/pls/apex/f?p=100:11:0::::P11_QUESTION_ID:10899066672589

In Oracle Database 10.1 some types of latches started to be replaced with mutexes (I think that Tanel Poder mentioned specific latches on his blog), are more were replaced in 10.2, 11.1, and 11.2. What was absolutely true about Oracle Database latching back in 1999, may have changed over the course of the last 13 years.

10 04 2012
Mich Talebzadeh

Index hash join

I am referring to Charles table at the beginning of this thread (table t1). A while back I was testing it. The table I have is practically the same as below

CREATE TABLE t1 AS
SELECT
ROWNUM k1,
DECODE(ROWNUM,1,1,0) c,
LPAD('A',255,'A') c3
FROM
DUAL
CONNECT BY
LEVELUSER,TABNAME=>'T1',CASCADE=>TRUE)

I run in as

ALTER SESSION SET TRACEFILE_IDENTIFIER = 'test_t1_with_trace';
EXEC DBMS_MONITOR.SESSION_TRACE_ENABLE ( waits=>true, plan_stat=>'ALL_EXECUTIONS' );
SELECT k1, c FROM t1 WHERE c  0;

tkprof output shows:

SELECT k1, c
FROM
 t1 WHERE c != 0


call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.01       0.08          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch        2      0.17       0.35       3911       3930          0           1
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        4      0.18       0.44       3911       3930          0           1

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 97

Rows     Row Source Operation
-------  ---------------------------------------------------
      1  VIEW  index$_join$_001 (cr=3930 pr=3911 pw=0 time=0 us cost=6388 size=8 card=1)
      1   HASH JOIN  (cr=3930 pr=3911 pw=0 time=0 us)
      1    INDEX FAST FULL SCAN IND_T1_C (cr=1826 pr=1818 pw=0 time=0 us cost=2276 size=8 card=1)(object id 87301)
1000000    INDEX FAST FULL SCAN IND_T1_K1 (cr=2104 pr=2093 pw=0 time=622503 us cost=2621 size=8 card=1)(object id 87300)

Now that index IND_T1_K1 is a unique index and we know that IND_T1_C has one row that qualifies for c != 0. However, it chooses the 1,000,000 row resultset as the build stream and 1 row resultset as the probe stream. This dores not look correct unless some role reversal is happening after?

Mich

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s




Follow

Get every new post delivered to your Inbox.

Join 141 other followers

%d bloggers like this: