February 26, 2012 (Modified February 27, 2012)
I found another interesting quote in the “Oracle Database 11gR2 Performance Tuning Cookbook“, this time related to tables and full table scans. This quote is found on page 170 of the book:
“If we do an FTS [full table scan], database buffers are used to read all the table data, and this situation may lead to flushing the buffer cache data to make room for the FTS data. To avoid this situation and to limit the consequences on the database buffer cache, the database blocks from FTS operations are put on the top of the LRU (Least Recently Used) list.”
What, if anything, is wrong (and/or right) with the above quote from the book?
—
Added February 27, 2012:
Part 2:
An additional interesting quote is found on page 176 related to full table scans when indexes are present. The test case that follows is slightly different than what is presented in the book, however the outcome is the same. Consider the following table and indexes (note that histograms will be created on columns C1 and C2, and that column C2 will have a single row with a 1 value and 999,999 rows with a 0 value):
CREATE TABLE T1 AS SELECT ROWNUM C1, DECODE(ROWNUM,1,1,0) C2, LPAD('A',255,'A') C3 FROM DUAL CONNECT BY LEVEL<=1000000; CREATE UNIQUE INDEX IND_T1_C1 ON T1(C1); CREATE INDEX IND_T1_C2 ON T1(C2); ALTER TABLE T1 MODIFY (C1 NOT NULL, C2 NOT NULL); EXEC DBMS_STATS.GATHER_TABLE_STATS(OWNNAME=>USER,TABNAME=>'T1',CASCADE=>TRUE,ESTIMATE_PERCENT=>100,METHOD_OPT=>'FOR ALL INDEXED COLUMNS SIZE 254')
The test case script:
SET LINESIZE 120 SET PAGESIZE 1000 SELECT C1, C2 FROM T1 WHERE C2=1; SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'TYPICAL')); SELECT C1, C2 FROM T1 WHERE C2<>0; SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'TYPICAL'));
The (slightly reformatted) output from my execution of the above script on Oracle Database 11.2.0.2:
SQL> SELECT 2 C1, 3 C2 4 FROM 5 T1 6 WHERE 7 C2=1; C1 C2 ---------- ---------- 1 1 SQL_ID 8fv30tbr8jdds, child number 0 ------------------------------------- SELECT C1, C2 FROM T1 WHERE C2=1 Plan hash value: 236868917 ----------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ----------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | | | 4 (100)| | | 1 | TABLE ACCESS BY INDEX ROWID| T1 | 1 | 8 | 4 (0)| 00:00:01 | |* 2 | INDEX RANGE SCAN | IND_T1_C2 | 1 | | 3 (0)| 00:00:01 | ----------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("C2"=1) --- SQL> SELECT 2 C1, 3 C2 4 FROM 5 T1 6 WHERE 7 C2<>0; C1 C2 ---------- ---------- 1 1 SQL_ID bu17044puyhkx, child number 0 ------------------------------------- SELECT C1, C2 FROM T1 WHERE C2<>0 Plan hash value: 3617692013 -------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | -------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | | | 3049 (100)| | |* 1 | TABLE ACCESS FULL| T1 | 1 | 8 | 3049 (1)| 00:00:01 | -------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter("C2"<>0)
Notice in the above execution plans that the Oracle query optimizer correctly determined that only 1 row would be returned in both cases, yet in the first case an index was used, and in the second case a full table scan. The book states the following:
“Why did the database optimizer switch back to a long-running FTS operation, instead of the previous Index Range Scan? The answer is simple – indexes cannot be used when we compare values with a not equal operator.”
I have seen the above answer, with slight variations, provided in at least two other books. What, if anything, is wrong (and/or right) with the above quote from the book?
The point of blog articles like this one is not to insult authors who have spent thousands of hours carefully constructing an accurate and helpful book, but instead to suggest that readers investigate when something stated does not exactly match what one believes to be true. It could be that the author “took a long walk down a short pier”, or that the author is revealing accurate information which simply cannot be found through other resources (and may in the process be directly contradicting information sources you have used in the past). If you do not investigate in such cases, you may lose an important opportunity to learn something that could prove to be extremely valuable.
Hi Charles,
Glad you brought up this
Starting from 9.2 specific for table scans, the small window at the end of the LRU ceased to exist. For full table scans (FTS) blocks now go to Midpoint insertion. Oracle deploys a parameter called small table threshold that allows for special treatment of tables that are smaller than two percent of the size of the buffer cache. Although through my own testings I much doubt that small table threshold holds true.
I believe dealing with FTS depends on two factors:
1. The default behaviour of Oracle 11 optimizer that favours direct path reads in dealing with FTS. Direct read scans are full scans that bypass the buffer cache. Data is read directly from the disk into PGA.
2. Turning off direct path reads. This will force the optimiser to use conventional path reads like db file scattered reads. In that case the way oracle decides to promote table blocks to hot area differs.
These are the summary of my findings:
• With direct path read full table scans for tables below 10% of buffer cache will have their blocks promoted to hot area and tables will be fully cached. This is around 5 times the small table threshold limit
• With conventional path reads and assuming the availability of free buffers, a table of around 95% of buffer size can be cached
Regards,
Mich
Just to clarify my sttaement above, that with conventional settings and default optimizer behaviour, with FTS, blocks from any table < 10% of buffer cache will move to hot area. Anythiing larger will be aged out (so called fetch and discard).
Mich
Charles,
For the follow-up point:
With a hint in place I got a plan that used the index (with the necessary full scan, of course).
Interestingly the cost of this plan was lower than the cost of the default plan which, in my case, was an index hash join between the two available indexes. So – the book was right that there was an “obvious” index path that was not taken when it should have been, but it was wrong in that (a) an index path was still taken, and (b) the “obvious” index path was hintable – Oracle can use indexes to satisfy “not equals”.
(The example above actually had ‘less than / greater than’ for rather than ‘!=’, but I changed them while copying the text into this comment in case WordPress took the angle brackets as a format command.)
(Just for clarification, the book selected all columns from the table, and only a single index was present on the equivalent of column C2. However, the statement in the book is written as an absolute.)
Jonathan,
It is interesting that the default plan in your case was an index hash join. If I try to force an index hash join, the calculated cost is a bit higher than that of the full table scan:
My test was performed with the following (mostly manually set) system statistics:
Let’s alter that MBRC system statistic to 8, gather statistics again with NO_INVALIDATE set to FALSE and try the unhinted version of the query again:
(<> manually changed to != for posting)
Interesting result… does that count as using the index when an inequality is specified on an indexed column? Richard Foote has an explanation for the FULL TABLE SCAN result.
Hi Jonathan, Charles,
I did Charles original test as follows (note I called the table notequal)
And ran the second query as follows:
The result showed
So sounds like it uses a hash join between the two indexes to cover the query without touching the base table. However, if we had added c3 then it had to go to table
So it begs the question if my test results are valid?
Mich,
My MBRC system statistic was set to 16 when I put together this blog article, where “TABLE ACCESS FULL” appeared in the execution plan for the != condition. Jonathan typically uses noworkload system statistics (so that his test case results are easier to reproduce) – the MBRC statistic has an effective default value of 8 when using noworkload system statistics. So, I changed the value of my MBRC to see what would happen:
The above change was sufficient to result in the index hash join that Jonathan mentioned.
If you flush the shared pool (or collect object statistics again with NO_INVALIDATE set to FALSE) and then execute the following (my MBRC was set to 16, but I am suggesting that you try 32 instead – I am not sure that this will work):
Do you find that the optimizer switches from an index hash join to a full table scan?
Charles
My current MBRC is set to 128
OK flush the shared pool and set DBMS_STATS.SET_SYSTEM_STATS(‘MBRC’,32)
I get the same!
The dsame plan as before. So your point “Do you find that the optimizer switches from an index hash join to a full table scan?”. No it does not!
P.S. I tried setting MBRC to 8 but got the same plan.
Mich,
Sorry – I should have stated that if your MBRC system statistic is already greater than 32, then my suggestion is not helpful. I am sure that you are aware of this, but the DB_FILE_MULTIBLOCK_READ_COUNT parameter and the MBRC system statistic have two different effects on costing (Randolf Geist has a couple of good write ups about the effects of the two on costing calculations) – with WORKLOAD system statistics the optimizer will use the MBRC system statistic rather than the DB_FILE_MULTIBLOCK_READ_COUNT parameter value for cost calculations.
What version of Oracle Database are you running? What do you have for the SREADTIM and MREADTIM system statistic:
There is a bug in 11.2.0.1 and 11.2.0.2 related to the automatic calculations of the SREADTIM and MREADTIM values (mentioned here https://hoopercharles.wordpress.com/2011/05/04/how-to-collect-statistics/ ), and that bug might be affecting your results.
Charles,
Thanks for clarification.
Your points
“What version of Oracle Database are you running? What do you have for the SREADTIM and MREADTIM system statistic:”
Next I ran as default with trace enabled. Rebooted the server before run
So that is 3910 pio and 3939 lio in 350 ms
Now let me try with tablescan hint after reboot
So this one does it with 38,463 pio, 38,468 lio in 1,140 ms
On the face of it the default behaviour (using two indices and hash join) looks a better choice.
Mich,
Thanks for posting the output of SYS.AUX_STATS$. The optimizer for your database is currently using noworkload statistics, and if I am remembering Randolf Geist’s articles correctly, the optimizer will use an effective MBRC system statistic value of 8 if your DB_FILE_MULTIBLOCK_READ_COUNT parameter IS NOT set in the init.ora (or spfile), or a value based on the DB_FILE_MULTIBLOCK_READ_COUNT parameter if that parameter IS set in the init.ora (or spfile). So, if your DB_FILE_MULTIBLOCK_READ_COUNT parameter was autoset to a value of 128, your results should be similar to those indicated by Jonathan above even though you adjusted the MBRC system statistic to a value of 32 using the DBMS_STATS.SET_SYSTEM_STATS procedure.
If you want to see the same results that I observed, you can manually set the same workload system statistics that I used for my test case in this article:
Oddly, changing the system statistics do not change the “optimizer environment”, and thus you will likely need to force a hard parse of the SQL statement by flushing the shared pool, enabling a 10046 trace, adding a comment/whitespace change, or re-collecting the table statistics with NO_INVALIDATE set to FALSE.
This is one of the interesting problems with test cases – not disclosing all of the influencing parameters/variables that are necessary to reproduce the test results could result in people producing entirely different results. But without the test case script, you might not have any clue that a different result could be achieved than what was described, and thus the need for test case scripts.
Great stuff Charles
Going back to the otiginal question. “Does the Oracle Optimizer use an index scan with NOTEQUAL in the predicate”, then the answer would be yes it can use index scan if conditions are satisfied. I understood from Jonathan’s response “– Oracle can use indexes to satisfy “not equals”. as being *it can use index with a hint*. However, here in uses it without a hint and it chooses over a tablescan.Is this a coprrect deduction?
Thanks
Mich,
Yes – your statement agrees with the finding above.
Honestly, when I set up the test case, I was expecting the first half of what you mention above. When I saw Jonathan’s comment, I realized that the test case had a second possible outcome that I had not expected. I could have (probably should have/intended to) also selected column C3 to eliminate the possibility of not needing to visit the table blocks for the query. Sometimes mistakes work out in one’s favor… in this case showing another possible reason why the book author’s (absolutely stated) statement might at times be incorrect.
Hi
My understanding in general is that are not optimizable. In other words with NOT EQUAL, we are basically suggesting we are interested in the vast majority of possible values with the exception of the value specified in the NOT EQUAL condition. The CBO has that costing in to ignore the index (I believe)
Now I can naively rewrite Charles query as (apologies table is now called T6)
Obviously it can only do two index range scans and pretty fast. Now if I compare this with select c1,c2 from t6 where c2 <> 0 Iget
The result with c2 <> 0 involving two index range scans is much faster.
Mich
Modified the first query per Mich’s request below. CH Feb 27, 2012
my query above should read
Jonathan’s comment is important – just because my test case results demonstrate one particular behavior – that single result is often insufficient to generalize the result to an unconditional statement of what will happen when provide a particular input. Simply altering the MBRC system statistic (and flushing the execution plan from memory) was sufficient in this case to change the execution plan from a full table scan to an index hash join.
Below are my review notes for the two quotes (these review notes were written before this blog article was posted):
Part 1:
The book states, “If we do an FTS [full table scan], database buffers are used to read all the table data, and this situation may lead to flushing the buffer cache data to make room for the FTS data. To avoid this situation and to limit the consequences on the database buffer cache, the database blocks from FTS operations are put on the top of the LRU (Least Recently Used) list.” This statement requires significant adjustment before it is an accurate statement. Is the author describing the behavior of Oracle Database 8.1 or Oracle Database 11.2 as indicated on the front cover of the book? What is meant by the “top of the LRU list” – is that the MRU (most recently used) end? If the author meant that the blocks were placed on the least recently used end of the LRU list, then the author agrees with the Oracle Database 10.2 documentation (http://docs.oracle.com/cd/B19306_01/server.102/b14220/memory.htm#sthref1280), but that documentation is incorrect (the behavior changed around the time 9.0.1 was released, if my notes are correct). The Oracle Database 11.2 documentation (http://docs.oracle.com/cd/E11882_01/server.112/e25789/memory.htm#BABCCJCB) states that blocks read by full table scan are placed at the mid-point of the LRU list (if the CACHE keyword is specified when the table is created or altered, then the table blocks will be placed on the MRU end of the list). Since the book is specifically about Oracle Database 11.2, it is worth pointing out that since the release of Oracle Database 11.1, the Oracle RDBMS often makes use of serial direct path read when performing full table scans, and that type of access completely avoids reading the table blocks into the buffer cache (the blocks are read into the PGA). Oracle event 10949 may be used to disable serial direct path read. What about parallel full table scans of larger tables? Those too will avoid flooding the buffer cache with blocks that may only be accessed once. Smaller tables will certainly have their blocks read into the buffer cache, but the touch count associated with each of the table’s blocks will limit the problems that those blocks will cause in the buffer cache (reference https://forums.oracle.com/forums/thread.jspa?threadID=2189618).
Part 2:
The book states, “The answer is simple – indexes cannot be used when we compare values with a not equal operator.” To clarify, the book is describing a situation involving a single table, when the only predicate on the indexed column is the inequality comparison. Adding an INDEX hint to the author’s sample SQL statement results in an INDEX FULL SCAN operation – the hint allows the SQL statement containing the not equal predicate to use the MY_CUSTOMERS_IXVALID index. It is not the case that the index cannot be used with an inequality comparison, however, the query optimizer does not automatically consider an access path involving the index due to the general assumption that many rows will be returned when all except one value is selected. To avoid the INDEX FULL SCAN operation, where the index structure is read one block at a time, the inequality could be converted to a less than predicate and a greater than predicate with a logical OR between the two predicates (reference http://richardfoote.wordpress.com/2008/08/13/indexes-and-not-equal-not-now-john/).
Just to address part 1, I think was the author is mentioning is rather confused. Even before 9.2 there was no such thing as “are put on the top of the LRU (Least Recently Used) list”. I did some drawings on this for pre 9.2 and post 9.2 but cannot load it. The behaviour pre 9.2 can be stated as:
Until release 9.2, Oracle used an MRU/LRU algorithm. Blocks within the buffer cache were ordered from MRU (most recently used) blocks to LRU (least recently used) blocks. Whenever a block was accessed, the block went to the MRU end of the list, thereby shifting the other blocks down towards the LRU end. When a block was read from disk and when there was no buffer available in the db buffer cache, one block in the buffer cache had to “leave”. It was the block on the LRU end in the list. Blocks read during a full table scan were placed on the LRU side of the list instead of the MRU side. As an example if the buffer cache consisted of 100,000 blocks and a table of size 200,000 blocks was full table scanned, accessing every block, the blocks would be read into memory in a batch size defined by the parameter DB_FILE_MULTIBLOCK_READ_COUNT. These blocks would go immediately to the least used end of the MRU/LRU chain thereby preventing a large table scan from overwriting the entire buffer cache. Instead, only a small number of blocks (i.e. 2 to 32) would be overwritten leaving the remaining cached table blocks from other segments still in cache. Each pool’s LRU is divided into a hot area and a cold area. Accordingly, buffers within the hot area are hot buffers and the ones in the cold are called cold buffers. By default, 50% of the buffers belong to the cold area and the other 50% belong to the hot area. A newly read single data block will be inserted between the cold and the hot area such that it belongs to the hot area. This is called midpoint insertion. However, this is only true for single block reads. Multi block reads such as in table scan, were placed at the LRU end, read and discarded without being promoted.
With regard to part 2, what the book states “The answer is simple – indexes cannot be used when we compare values with a not equal operator.”
It is NOT that indexes cannot be used. It is more to do with the fact that the optimizer costing does not consider index with a NOT EQUAL operator. However, if the data is not evenly spread (skewed) for the predicate and the NOT EQUAL condition only satisfies a small proportion of the rows then an index with an index hint can be successfully used.
Mich,
Well stated. Richard Foote’s blog agrees with your comment:
http://richardfoote.wordpress.com/2008/08/13/indexes-and-not-equal-not-now-john/
I know at least two engines (Orace and Sybase) that have optimizer costing ignoring NOT EQUAL condition and I did both tests In Oracle and Sybase using the same table and index ensuring that only 10% of data say satified the condition you will get with != sign.
Just to highlight the point
Now if I try a default optimizer behaviour Iget
However, if I force it to use index t5_i1 with columns n1, ind_pad, n2
it will come back
Apologies for using non Oracle in this case and I trust I have not been pedantic. Just wanted to emphasise a general optimizer costing shared between these two engines. The clue lies in the difference between “Total estimated I/O cost” of 37,111 with index versus 28,704 with table scan. So the CBO will always assume that using index is going to be more expensive than table scan (in this case). In terms of actual I/O costing, we have 3,030 with index versus 7,184 with table scans.That only happened because we told the optimizer to use the index and we knew about the spread of the data.
Cheers,
Mich
MIch,
Thank you for running the test on Sybase.
Now you have me curious about something regarding Sybase. Without a histogram, Oracle’s optimizer would not know that roughly 10% of the rows were not equal to 0 – is there an equivalent to Oracle’s histograms on Sybase?
As a person will (almost) no knowledge of Sybase, the estimated cost of the index access path at 37111 is a bit higher than the estimated cost of the full table scan at 28704 – is there a way to artifiically lower the calculated cost of an index access path on Sybase, similar to the effects of the OPTIMIZER_INDEX_COST_ADJ parameter on Oracle Database (Sybase seems to be counting individual blocks when calculating costs, while Oracle’s optimizer costs multi-block reads), so that you could see if Sybase might actually select the index access path without a hint?
Hi Charles,
I wanted to do some further tests to ensure the reason why after having 90% of data in my table != 0, Sybase was not using the index. Of course Sybase using CBO has to rely on histogram (table level and column level stats, ) and various cluster ratios to give an estimate of what it expects.
So first we have the following composition of distrubution of data in t5
OK I updated stats as usual for table and all indexed columns
Now the problem i(as you also alluded to it) was the estimate of physical I/Os for using the index including large I/O and the asociated CPU overhead! The index scan with the (n1 != 0) predicate eliminated majority of the need to read the data pages However, the optimizer’s cost estimate could not be 100% accurate here due to the “random” mapping between index rows and data pages (blocks). Also it had to take into account another parameter called “data page cluster ratio” (dpcr)
For an index, dpcr measures the effectiveness of large I/O for accessing the table using this index. In this case after 90% data update, this ratio went down (1 is good 0 is bad, see below). So the optimizer overestimated the index plan (IO and CPU overhead) and favourd a table scan! I updated 90% of data but we did not coalesce and rebuild index. Much like Oracle an index rebuild should reduce index range scan I/O: So I checked the cluster ratio before and after I did “reorg rebuild” in sybase on that table
WITHOUT reorg rebuild after 90% of row updates, I had:
dpcr
—————————
0.461519
With REORG REBUILD this was improved to
dpcr
—————————
0.944368
Then I ran the query without any hint or anything
It just used the index. In actual fact it did 973 logical reads. There were 124 physical reads of which 8 were 8k reads and 116 were (apf) Asynchronous Pre Fetch (multi block reads). In this case these were 64k index page reads. Sybase uses a unit-less wighting factor IO cost = All physical IOs X 25 + All logical IOs X 2 + CPU cost X 0.1. Essentially it is a weighting factor applied to the (estimated) logical and physical IO done by the query plus CPU overhead. In this case 2 units for a logical IO, 25 units for a physical IO and 0.1 unit for CPU. The assumption is that data has to be read in from disk and physical IO may include large page reads.
So the estimate was (from trace file, yes it is as exciting as Oracle’s raw trace file!)
PopRidJoin refers to the physical operator (Pop) joining the index to the table through the Row ID. This follows the index scan operator PopIndScan. Much like Oracle, each operator’s output is the input to its parent operator. Here PopRidJoin is the parent operator of PopIndScan that provides the RID. You use the index to get to the rows via rowID and then in each row you get the column “padding”. The 0(L.., P.., C..) is the logical I/O, Physical I/O and CPU numbers for the top operator under serial costing.
This model seems to overestimate the CPU costing! In total it puts cost at 17485.48. With 928 logical I/O, 129 Physical I/O and 123967 CPU usage. That is IO cost = 928 * 2 + 129 * 25 + 123967 * 0.1 = 17477.7, roughly what it gets (17485). The optimizer considers table scan as well. But the cost is higher now
So table scan estimsated cost is now 28174.8 compared to index usage of 17485). Table scan estimate costing is 3226 logical I/O with 405 Physical I/O plus 115978 CPU usage
Back to index usage, in practice it uses no CPU time at all. It says
Use that formula again I get 973 * 2 + 124 * 25 + 0 *.1 = 5046.0 which is what it says.
Rather long winded but worth exploring. Now the challenge for me is to make my Oracle t5 table to use index without any hint or costing adjustment.
Cheers,
Mich
I am trying to do the same with the same base table in Oracle. That is making it to use the index once 90% of n1 values are set to 0. just to catchup
Now use the following proc to populate t5 with 100,000 rows:
Now set 90% of records with n1 = 0
OK now use trace to see what is going on.
First do the query with no modification
Ok 170 ms with 3715 block reads. With 46 pio that roughly works out around 80 multi-block read from table for each pio
Now what if I hint index.
That is almost on par with table scan. So not very useful as table scan outweights index usage
But I know that the leaf level of this index is highly disordered after 90,000 row updates. A rebuild, drop/create of the index may be all that is needed, but not a guarantee. The key value here is efficiency of this index. That is a measure of how many block jumps are taken reading the base table while scanning this index. I guess this is a highly simplified explanation. The point is that these block jumps will be minimal if the ordering nature of the index is closely the same as the base table. Ok I drop and recreate that index again followed by stats update. Then run the quert again with index hint
Ok that is now better. The number of disk reads has gone down from 2,599 blocks (with fragmented index) to 1,206 blocks after index rebuild. With the index hint after rebuil. the elapsed time is down from 170 ms for table scan to 100 ms for index scan.
As Charles suggested I tried to adjust the costing to favour index and run my query. This was done as before after updating 90% of rows for n1=0 and dropping and recreating the index t5_i1
So even at 10% cost adjustment, the optimizer still favours index scan.
So far (AFAIK) we have seen examples where index hint can provide faster results. However, does that mean that without an index hint the optimizer is going to choose FTS with != predicate under all conditions?
Thanks,
Mich
Mich,
Thank you for providing the experiment where you compared Sybase with Oracle Database – a case where an index access is the obvious choice only for Sybase. If I recall correctly, Oracle’s optimizer does not even calculate the cost of the index access path for an inequality – the optimizer just assumes that the index access path will be inefficient.
I believe that Richard Foote covered the inequality scenario here:
http://richardfoote.wordpress.com/2008/08/13/indexes-and-not-equal-not-now-john/
Guys you rock here.
Very interesting stuff for learners like me.
Can you please clarify if CBO always makes estimated costing on the basis that the data has to be brought in from persistent storage (disk). This could be single or multi-block reads and would ignore whether some of the data or all of the data is already in the buffer?
Regards
Ashok
Ashok,
The simplified answer to your question is that the query optimizer’s calculated costing assumes that all blocks accessed will require physical reads from disk. Oracle Database provides the OPTIMIZER_INDEX_CACHING parameter to allow the optimizer to make assumptions about index block caching for in-list operations and nested loops operations (see http://docs.oracle.com/cd/E14072_01/server.112/e10820/initparams161.htm ).
Randolf Geist has a couple of articles that may be of interest to you:
http://oracle-randolf.blogspot.com/2011/07/cost-is-time-next-generation.html
http://oracle-randolf.blogspot.com/2009/04/understanding-different-modes-of-system.html
Charles,
Thanks for the opportunity to experiment with some additional stuff and learn more.
I made a statement to the effect that “..I know at least two engines (Orace and Sybase) that have optimizer costing ignoring NOT EQUAL condition..”. Not true. One of them honors the index when conditions favour the use of an index. So I was wrong here and I was wrong to state that in my Transact SQL book. Also I tried your original test case “CREATE TABLE T1 AS SELECT ROWNUM C1 ..” on Sybase and that used the index as well. Obviously with Oracle the optimizer just ignores the index by default.
Mich
Guys,
I have something that I like someone to shed some lights on it for me.
Just to recap the broad agrrement is that the optimizer can use an index with not equals, but it would not in general and it would not make sense in general unless the data was so massively skewed. Perhaps it would be easier to point out that the optimizer will not use an index range scan for a not equals.
So Optimizer can use an index fast full scan. That is read the index as if it were a table using multiblock IO, does not read it in sorted order, just reads the blocks as it hits them.
If we have above does Oracle do some prefetching of the blocks at all to speed up the procressing? I am coming from this angle that in Sybase we see:
So we have the equivalent of index fast full scan starting at index start and using 64K reads for each physivcal IO. In this case my database is 8k page (think block) and in one IO it does 8 index pages. Index pages are denser than data pages. However, this large IO read serves another purpose. There is a term APF for Asynchronous Pre Fetch or read ahead pages. At the bottom it say physical reads total = 381 of which there were 373 APF and 8 regular. That means that Sybase dispatches the IO request *ahead* of when it is actually needed. So the only time it has to wait for an APF is when it gets to a page that should have been there for an APF and it had not been prefetched yet. So the whole idea of an APF is to asynchronously prefetch pages so that when the execution engine gets to that page, it is a logical IO vs. a physical IO. This is supposed to speed up the process.
I am pretty sure Oracle can do the same. However, I am looking around in the raw trace file for some signal. Jonathan in his blog http://jonathanlewis.wordpress.com/2006/12/15/index-operations/ mentioned “Check the 10053 trace and you will probably see a line like “index prefetching enabled” somewhere. However, I am afraid I had little luck with this.
Aprreciate if there is an input on this.
Mich
Mich,
I believe that your statement is correct with a small adjustment:
That statement agrees with my understanding.
I am not sure that I understand this statement. During an index fast full scan, the runtime engine will read multiple blocks at a time up to the number of blocks specified for the DB_FILE_MULTIBLOCK_READ_COUNT (the actual number of blocks read in a single read request could be smaller based on extent sizes and blocks that are already in the buffer cache). As far as I am aware, this multi-block read is not considered “prefetching”, which seems to be more commonly associated with batching multiple block reads (for instance during an INDEX FULL SCAN) that would otherwise result in db file sequential reads into either db file scattered reads (logically adjacent blocks) or db file parallel reads (logically non-adjacent blocks) during an index range scan. The number of logically non-adjacent blocks read using a db file parallel read is controlled by the _DB_FILE_NONCONTIG_MBLOCK_READ_COUNT hidden parameter, which seems to default to a value of 11 blocks in 11.2.0.2. If I remember correctly, _DISABLE_INDEX_BLOCK_PREFETCHING may be set to TRUE to disable this optimization, and there may also be an Oracle EVENT that also disables the combining of multiple non-adjacent block reads into a single db file parallel read request. See the following OTN thread for a discussion between Timur Akhmadeev and me regarding index prefetching:
https://forums.oracle.com/forums/thread.jspa?threadID=941864
Jonathan Lewis also nicely summarized prefetching on page 82 of this article:
Click to access ls2.pdf
We are able to see prefetching with this example (using the same table and index created at the start of this article):
I will process the following query in the DBMS Xplan viewer in my Hyper-Extended Oracle Performance Monitor (https://hoopercharles.wordpress.com/2012/03/12/thoughts-on-a-hyper-extended-oracle-performance-monitor-beta/ ) with an enabled 10046 and 10053 trace:
The execution plan output follows:
The 10046 portion of the trace file includes the following, which demonstrates prefetching of index and table blocks:
I did not find the words “index prefetching”, or “prefetch” in the 10053 portion of the trace file, so I suspect that notation has been removed from recent versions of 10053 trace files.
Hi Charles,
Thank you for your explanation and example. I think multi-block IO *is* prefetching by definition. So I still have this point if I may
Now wiyh multi-block IO by time the query engine gets to that block it is a kind of logical IO as opposed to physical IO?
That means that Oracle dispatches the IO request *ahead* of when it is actually needed. So the only time it has to wait for a prefetched block is when it gets to a block that should have been there for multi-block IO and it had not been prefetched yet. So the whole idea of a multi-block IO is to asynchronously prefetch blocks so that when the query engine gets to that block, it is a logical IO vs. a physical IO?
Mich
Mich,
Sorry for the delay in response.
After reading the following blog article, I am a bit more inclined to agree with your statement that multi-block IO is prefetching by definition:
http://oracle-randolf.blogspot.com/2011/07/logical-io-evolution-part-2-9i-10g.html
However, by extension of the same definition (“multi-block IO is prefetching by definition”), reading an entire 8KB database block (when only the first row in the block is currently of interest) could also be considered “prefetching by definition”. 😉
Your second point might need a little clarification because logical IOs may or may not result in physical IOs. It might be more accurate to state that multi-block reads help reduce the average time to access each block when physical block reads are required – this is especially true if hard drive heads must be repositioned (or the required delay of waiting for the hard drive platter to spin around again – 0.004 seconds for a 15,000 RPM drive). With a multi-block read (when physical hard drives are used, with no built-in read ahead cache), most of the latency in the multi-block physical read (assuming a db file scattered read or direct path read) occurs when accessing the first block, while the rest of the blocks accessed in the single read request are obtained *mostly* with very little additional latency.
Some hard drives (and drive controllers) have special algorithms to reduce hard drive head movement (and rotational) latency when multiple block requests are in the read/write queue (sometimes called native command queuing). It seems that non-logically-adjacent blocks that are prefetched with DB FILE PARALLEL READ through a prefetch operation are designed to leverage the read/write command queuing algorithms to reduce the average physical block read time: http://docs.oracle.com/cd/E11882_01/server.112/e17110/waitevents003.htm#autoId30
There are probably better (and more accurate) ways to state the above.
Charles,
Many thanks as usual for your analysis.
It seems that I am hearing more gurus using the term multi-block IO is prefetching and I believe every DBMS that does prefetching does so asynchronously. So I wish we could use asynchronous prefetch as the naming convention 🙂
So we have now established that with full table scans and fast full index scans, Oracle will deploy prefetching where it can. So there is obviously a part of execution engine that does the prefetching and there is another part that does the query work?
When the query part of the execution engine (is this correct?) gets to the block needed and it is not there, then it has to wait. That block could already be a dirty buffer. In that case, I assume that it will have to go to undo to get that block which will incur an additional physical io?
Mich
Mich,
Sorry again for the delayed response – I have been preoccupied with non-Oracle Database items lately.
I am not sure that I would classify prefetching as an asynchronous operation – that could cause a bit of confusion. Asynchronous I/O is controlled by the DISK_ASYNCH_IO parameter (and the FILESYSTEMIO_OPTIONS parameter):
http://docs.oracle.com/cd/E11882_01/server.112/e10803/config_db.htm#BGBIHHBD
As shown at the above web page, the V$IOSTAT_FILE view may be queried to determine if asynchronous I/O is enabled for various file types. A slightly modified version of the SQL statement executed on a 11.2.0.2 Windows database instance:
The primary reason why I do not consider prefetching to be an asynchronous operation is because the Oracle process requires one or more of the requested blocks from disk in order to continue processing (unless there are exceptions that I am not considering) – processing for the process halts until the required block is read from disk. Some people will argue that reading a block from memory is 1000 times faster than reading the same block from disk (others will argue that it is closer to 100 times faster due to all of the code that executes during a logical read) – this speed difference, coupled with the fact that something provoked the Oracle process to request the pre-fetched blocks from disk (maybe one of the 16 blocks read is known to be required), means that the session’s process likely is not number crunching asynchronously while a block read request completes.
It is not sufficient to state that dirty buffers require undo to be applied – a dirty buffer is a block that was modified since being read from disk. The data block could have been written to the datafiles when there were uncommitted changes in the block. Undo is stored in a tablespace, just like the regular table and index blocks, and the undo blocks also share the buffer cache with the regular table and index blocks. The requirement to apply undo to logically roll a block back to a specific SCN does not necessarily require an additional physical I/O.
Hi Charles,
Many thanks for your explanation. I have come to conclusion that your point about prefetch not being asynchronous is valid, although I still think with full table scan and index fast full scan where Oracle reads the index as if it were a table using multiblock IO, does not read it in sorted order there is some asynchronous prefteching going on.
I was actually looking at an issue of deadlock on indexes with two procs running at the same time and browsing Steve Adams internal book on latches etc and came across an statement that with STATISTICS_LEVEL = ALL, there are far more latches that Oracle takes. In contrast, it appears that when the parameter is set to TYPICAL (the default), the rowsource/plan execution statistics are disabled. With statistics level = all, a typical query runs many times slower and I gather it is due to system overheads. Although I have not seen any particular explanation for it except that do not use STATISTICS_LEVEL = ALL unless you have to!
Cheers,
Mich
Mich,
I had heard previously that most of the overhead associated with setting the STATISTICS_LEVEL to all is associated with system calls to obtain the current time of day (gettimeofday). This particular issue is mentioned in Metalink (MOS) Doc ID 436797.1, and seems to be more severe prior to Oracle Database 11.1. There is a summary in the “Expert Oracle Practices” book on page 313 that demonstates the impact of the TYPICAL and ALL values of STATISTICS_LEVEL for a SQL statement that performs a large number of nested loops joins:
http://books.google.com/books?id=tUNtyMCwDWQC&pg=PA313
I will have to look into the increase of of the number of latch gets with STATISTICS_LEVEL=ALL – I had not considered latching as a potential threat.
The deadlock on indexes that you mention – might that just be a sign of two sessions trying to insert rows with the same primary key value into two different tables?
Charles,
Your point on “The deadlock on indexes that you mention – might that just be a sign of two sessions trying to insert rows with the same primary key value into two different tables?”
Slighlly more interesting than that. The same proc running with two different sets of parameters trying to do update on the same table with one normal b-tree index. The issue is that the proc was doing multiple updates and depending on the set size (chosen by parameters) at part of the code was doing a table scan and other parts were using the index. THe deadlock was one instance of proc doing a table scan and the other was using the index (much smaller result set).
Steve Adams in his book p 28-29 mentions to the effect that
“Oracle expects latches to be held only briefly and intermittently. Otherwise if the use of latch is not brief or intermittent then there can be contention (remember they do not allow multiple processes to inspect the protected data concurrently). It makes sense to expect that a single Oracle process to hold a number of latches concurrently. Therefore there is a possibility of latching deadlock occurring. One process holding latch A and another thread holding latch B and both processes spinning [on their semaphore] and waiting for the alternate latch. Oracle ensures that this cannot happen by ensuring that latches are always taken in a defined order when more than one latch is needed. To support this, Oracle must maintain different latch levels. When a process attempts to get a latch, a check is made to ensure that it is not already holding a latch of the same level or latch of higher level. A process that needs a higher level latch has to sleep while holding a lower level latch. In this scenario, the lower level latches are held much longer than normal. In general if this happens, it can result in a deadlock.”
Interesting point but still investigating it.
Mich
Mich,
Sorry again for the delay in my response. It seems that I am having a bit of difficulty finding time to respond to comments lately due to a number of non-Oracle related activities. Windows Update is broken on a relative’s computer, and as a side effect, I cannot install the latest version of the virus scanner brand that I have used since 1995. I have been fighting that problem for almost a week. Lots of other non-Oracle computer related things too.
—
Hanging head in shame I have not yet had a chance to read the book that Steve Adams wrote. I did manage to find his book in the Google Books library:
http://books.google.com/books?id=KG4j7NuHbRsC&pg=PA28
What is shown in the Google Books library seems to indicate that your quote above is a paraphrase of the book – not necessarily a bad paraphrase. For instance, the last sentence on page 29 states ( http://books.google.com/books?id=KG4j7NuHbRsC&pg=PA29 ):
The description of your problem regarding deadlocking on indexes seems to point a little more directly at enqueues (locks) rather than latches (I could of course be completely wrong). Jonathan Lewis took an advanced look at the differences between enqueues and latches in his recent “Oracle Core” book:
http://books.google.com/books?id=G9AJA91PL54C&pg=PA59
The AskTom site has also covered the differences between latches and enqueues several times (you may not be able to click this link):
http://asktom.oracle.com/pls/apex/f?p=100:11:0::::P11_QUESTION_ID:10899066672589
In Oracle Database 10.1 some types of latches started to be replaced with mutexes (I think that Tanel Poder mentioned specific latches on his blog), are more were replaced in 10.2, 11.1, and 11.2. What was absolutely true about Oracle Database latching back in 1999, may have changed over the course of the last 13 years.
Index hash join
I am referring to Charles table at the beginning of this thread (table t1). A while back I was testing it. The table I have is practically the same as below
I run in as
tkprof output shows:
Now that index IND_T1_K1 is a unique index and we know that IND_T1_C has one row that qualifies for c != 0. However, it chooses the 1,000,000 row resultset as the build stream and 1 row resultset as the probe stream. This dores not look correct unless some role reversal is happening after?
Mich