July 4, 2012
It has been several months since I posted my review of the first half of the “Oracle Database 11gR2 Performance Tuning Cookbook” book. I had planned to post the review for the second half of the book a couple of weeks later, however four months have passed by since that time, and the second half of the review has yet to materialize. Someone recently left the following comment attached to one of my reviews on Amazon:
“This is the most comprehensive feedback I have ever read!!”
I appeciate the comment. Keep in mind that particular review was only roughly 9.5 typewritten pages in length; it takes a lot of time to put together an 18, 24, or 35 typewritten page review that is just as comprehensive as the one that elicited the above comment.
On to the topic of this article…
While reading the “Oracle Database 11gR2 Performance Tuning Cookbook” book, I noticed an interesting set of statements about the use of histograms (bolded sections appeared bolded in the book):
“Histograms are useful only on indexed columns containing skewed values, because they help the optimizer to choose whether to use the index or not to access values. Obviously, if the frequency for a specific value is very high, using the index won’t be the best choice….”
“Don’t use histograms in situations where:
- The column is not used in the WHERE clauses of the queries
- The data in the column is uniformly distributed
- Bind variables are used when comparing against the column”
What, if anything, is wrong with the above quote from the book? If possible, provide test cases to support or refute the above quote from the book (see the tips at the bottom of the blue section at the right of this blog page regarding how to post code sections).
The point of blog articles like this one is not to insult authors who have spent thousands of hours carefully constructing an accurate and helpful book, but instead to suggest that readers investigate when something stated does not exactly match what one believes to be true. It could be that the author “took a long walk down a short pier”, or that the author is revealing accurate information which simply cannot be found through other resources (and may in the process be directly contradicting information sources you have used in the past). If you do not investigate in such cases, you may lose an important opportunity to learn something that could prove to be extremely valuable.
Well, there are several things here. First, histogram can also be used on a non-indexed column, in a nester loop type join, to determine the access path to the inner table. Also, it may be used to decide between hash join and sort/merge join,
Second, the phrase “do not use histograms in the situations where columns is not used in the where clauses” strikes me as odd. Histograms are not used in the situations, they’re objects, created on columns or, in version 11G, groups of columns or column expressions.
Third, the advice about the bind variables. Oracle has bind variable peeking since version 9i and the first value used is put through the CBO paces, including histograms, to determine the execution plan. Some newer versions have something called “adaptive cursor sharing”, which will cause the statement to be re-parsed if the bind variable values have different characteristics. Again, ACS and bind aware cursors rely on histograms.
About the only thing I concur with is the statement about histograms being unnecessary in case of uniform distribution. That is also my pet peeve with Oracle: it should not create histograms on columns with unique constraints and declared not null. Single column primary keys should not have histograms, period.
Finally, happy 4th of July to everybody. Charles, you should ignore histograms today and devote some attention to BBQ. Ah yes: histograms should not be used on 4th of July.
Mladen,
Histograms on the 4th of July, probably not the best combination. 😉
Nice answers to start the dialog regarding this quote from the book. In the book, the third bullet point item mentioned that adaptive cursor sharing with bind peeking would be covered in a later recipe, but the bullet point item did not mention whether or not adaptive cursor sharing makes the bullet pointed statement stronger or weaker.
I found something related to the book quote, from the Oracle Database 8.1.7 documentation:
http://docs.oracle.com/cd/A87860_01/doc/server.817/a76992/stats.htm#27069
Equality predicates… that must be the tie in with the 4th of July.
Since “skewed values” is in bold in the book, I wonder if the definition of “skewed” is important. To me, that term means that some of the distinct values in a column have significant more (or less) rows with that value, when compared to other distinct values in the same column. Uniformly distributed could mean that all of the distinct values in a column have roughly the same number of rows sharing those distinct values.
Let’s set up a simple test case. Since I read somewhere that NULL values cause problems, I will use an “out of bound” value of 10,000 in place of any NULL that might naturally appear in a column. The table built will contain the numbers 0 through 100 in column C2 and 1 through 100 in column C3 with every 101st row considered to have a NULL value in that column (thus the value 10,000):
In the above, statistics were collected on the table without histograms. Let’s try a couple of tests. First, I want to see the rows with either the maximum value in column C3, or the “out of bound” value representing a NULL.
That worked well without a histogram, the cardinality estimate is exactly correct. If table T2 were then joined to another table, the accurate cardinality estimate could help the optimizer determine if table T2 should be the driving table, and which type of join method should be most efficient.
Let’s try again with a SQL statement that produces the same resultset as the above SQL statement:
That did not work so well. There are only 100,000 rows in the table, and the optimizer is predicting that roughly 100,000 rows will be returned by the query. If the resultset from table T2 were then joined with another table, the performance could be less than optimal due to the error in the cardinality calculation.
Let’s try again, this time trying to retrieve the values without a “NULL” in column C3:
The optimizer also had difficulty with the cardinality calculation for this query. Not a big deal for this query because there is no join, and no index on column C3. But, what if there were a join or an index on column C3?
Let’s try again with a histogram on column C3:
The first query:
The cardinality estimate is less accurate than before the histogram was built, but the estimate is still reasonably close.
The second query:
The cardinality estimate is significantly better than we saw before the histogram was built.
The third query:
Again, the cardinality estimate is significantly better than we saw before the histogram was built.
—
Sorry, I might have skewed too far off topic. Any other comments? “What, if anything, is wrong with the above quote from the book?” I will withhold my comments in this article for at least 24 hours.
“About the only thing I concur with is the statement about histograms being unnecessary in case of uniform distribution. That is also my pet peeve with Oracle: it should not create histograms on columns with unique constraints and declared not null. Single column primary keys should not have histograms, period.”
Oracle 11 does not gather a histogram for such columns – unless the column values have huge outliers and in that case it makes sense to have a histogram if you have a range predicate on that column. Where it still gathers a (frequency ) histogram is for columns with a single value for all rows and there I can not see any rationale for having a histogram. I’m not sure right now, would have to test it, but it may be that the optimizer ignores the statistics ( density ) in this case and uses a more sensible density calculated at run time.
Thanks Wolfgang, this is very good to know.
On the other hand, I just checked and here is my result:
SQL> analyze table emp delete statistics;
Table analyzed.
Elapsed: 00:00:00.03
SQL> select column_name,count(*) from user_tab_histograms
2 where table_name=’EMP’
3 group by column_name;
no rows selected
SQL> get /tmp/1
1 begin
2 dbms_stats.gather_table_stats(
3 ownname => user,
4 tabname => ‘EMP’,
5 method_opt => ‘FOR ALL INDEXED COLUMNS SIZE 254′);
6* end;
SQL> /
colum column_name format a15
SQL> select column_name,count(*) from user_tab_histograms
2 where table_name=’EMP’
3 group by column_name;
COLUMN_NAME COUNT(*)
————— ———-
SYS_NC00009$ 12
ENAME 14
EMPNO 14
Elapsed: 00:00:00.02
SQL>
In other words, Oracle has duly collected stats on the single-column primary key (EMPNO). The version is 11G:
SQL> select * from v$version;
BANNER
——————————————————————————–
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 – 64bit Production
PL/SQL Release 11.2.0.3.0 – Production
CORE 11.2.0.3.0 Production
TNS for Linux: Version 11.2.0.3.0 – Production
NLSRTL Version 11.2.0.3.0 – Production
Elapsed: 00:00:00.01
SQL>
January and April PSU’s are applied. This is as new as it gets.
But you asked it to do that with your method_opt of ‘… size 254′. Then, of course it does what you asked. It will do the right thing if you use method_opt=>’for all columns size auto’
Wolfgang, why would I do “size auto” when I don’t know what size of a histogram will Oracle create? I am usually using size 254 or size 127 but never “size auto”. In some releases, SIZE AUTO means that there will be only 2 buckets: minimum and maximum, which is about the same as no histograms at all.
This is a reply to Mladen’s last reply. For some reason there is no reply “button” on his reply.
If you do a size nnn then that is exactly what gather_stats will do. It could be fewer buckets if there are fewer than nnn distinct values of the columns. If there are more distinct values than nnn then you end up with a height-balanced histogram. In any case, yo WILL get a histogram because that is what you requested. If you ask for a histogram with method_opt=>’… size nnn’ then I there is no reason to go with anything but size 254 – unless you deliberately want to force a height-balanced histogram for a column with fewer than 255 distinct values. But in that case you better know exactly what you are doing and you may actually be better off to not gather the histogram in that case but construct it the way you want manually.
If “size auto” gives you only two buckets then that is either because of one of the cases we discussed, e.g. the column of a single-column unique or primary key, where gather_stats determined that a histogram makes no sense. Or the column has never been used in a where clause.
I agree with Mladen: The first statement seems definitely wrong: “Histograms are useful only on indexed columns…” That’s at least what I have learned from Wolfgang Breitling in HISTOGRAMS – MYTHS AND FACTS (http://www.centrexcc.com/Histograms%20-%20Myths%20and%20Facts.pdf). Wolfgang shows clearly that Histograms on non-indexed columns are very useful when evaluating the cost of different join methods and orders.
The second point refers to bind variables and histograms: It’s true that in the past histograms and bind variables on the same column have been a very dangerous combination. With Adaptive Cursor Sharing (ACS) I would recommend on having a histogram on such columns. In other words: I would regard upon the “Don’t use histogram” for columns with Bind Variables as misleading to say the least.
There are a lot of excellent comments attached to this blog article. The linked article written by Wolfgang Breitling is excellent – I am sure that I read the article at least once a couple of years ago (a few changes were made to the query optimizer in Oracle Database over the years, so there are a couple of minor items in the article that no longer work as described in the latest Oracle Database releases). I thought that I would share my review comments for this section of the book (the following review comments echo many of the comments provided by other readers of this blog article).