June 28, 2011
While reading the alpha edition of the “Oracle Database 11g Performance Tuning Recipes” book, I noticed a couple of interesting SQL statements in Recipe 5-17 that just did not look right. One of those SQL statements, slightly reformatted, appears below:
SELECT A.CURRENT_OBJ#, O.OBJECT_NAME, O.OBJECT_TYPE, A.EVENT, SUM(A.WAIT_TIME + A.TIME_WAITED) TOTAL_WAIT_TIME FROM V$ACTIVE_SESSION_HISTORY A, DBA_OBJECTS D WHERE A.SAMPLE_TIME BETWEEN SYSDATE – 30/2880 AND SYSDATE AND A.CURRENT_OBJ# = D.OBJECT_ID GROUP BY A.CURRENT_OBJ#, D.OBJECT_NAME, D.OBJECT_TYPE, A.EVENT ORDER BY TOTAL_WAIT_TIME;
The above SQL statement is supposed to retrieve a list of object names that experienced the most wait time in the last 15 minutes. What is wrong with the query? Consider that a slight variation of the above query is found in the following locations:
- Page 1,170 of the book “Expert Oracle Database 11g Administration“
- An OTN thread where the original poster claimed that the query required too much time to execute.
- An undated news article.
- Page 213 of the book “Oracle Tuning: The Definitive Reference, Second Edition“
- Page 160 of the book “Oracle Tuning Power Scripts: With 100+ High Performance SQL Scripts”
So, what is wrong with the query? OK, the table aliases are inconsistent, but that is too easy. Let’s refer to the Oracle Database 11.2 documentation for the V$ACTIVE_SESSION_HISTORY view.
“V$ACTIVE_SESSION_HISTORY displays sampled session activity in the database. It contains snapshots of active database sessions taken once a second. A database session is considered active if it was on the CPU or was waiting for an event that didn’t belong to the Idle wait class. Refer to the V$EVENT_NAME view for more information on wait classes.
This view contains one row for each active session per sample and returns the latest session sample rows first. A majority of the columns describing the session in the active session history are present in the V$SESSION view.”
Still not seeing the problem? If we believe the documentation, assume that we are only interested in a 10 second time interval, and one of the sessions started waiting on a wait event at the start of the time interval, and was still waiting in the same wait event at the end of the 10 second time interval. SUM(A.WAIT_TIME + A.TIME_WAITED) for the session (assuming that rounding does not happen due to view update frequency) would show 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + 10 = 55 seconds of total wait time for the result of the SUM function. Not too bad, right? Now, assume that the session had already waited 240 seconds and continued to wait for an additional 10 seconds – the SUM(A.WAIT_TIME + A.TIME_WAITED) for the session would show that the session waited 240 + 241 + 242 + 243 + 244 + 245 + 246 + 247 + 248 + 249 + 250 = 2695 seconds (44.92 minutes) in that 10 second time interval.
—
A fun little test that will require two sessions. In Session 1:
CREATE TABLE T1 ( C1 NUMBER, PRIMARY KEY(C1)); INSERT INTO T1 VALUES ( 1);
Now, in Session 1, determine the SID:
COLUMN SID FORMAT 999 SELECT SID FROM V$MYSTAT WHERE ROWNUM=1; SID ---- 189
In Session 2, determine the SID:
COLUMN SID FORMAT 999 SELECT SID FROM V$MYSTAT WHERE ROWNUM=1; SID ---- 4
Back in Session 1, insert a row into the table:
INSERT INTO T1 VALUES ( 1);
In Session 2, attempt to insert the same row into the table (a primary key violation if the first session commits):
INSERT INTO T1 VALUES ( 1);
Session 2 is now hung. Back in Session 1, wait 20 minutes and check AWR ASH using a slightly fixed version of the query (mainly the table aliases are corrected, and adjusted to examine only the last 10 seconds of entries in V$ACTIVE_SESSION_HISTORY for the two sessions of interest):
SELECT A.CURRENT_OBJ#, D.OBJECT_NAME, D.OBJECT_TYPE, A.EVENT, SUM(A.WAIT_TIME + A.TIME_WAITED) TOTAL_WAIT_TIME FROM V$ACTIVE_SESSION_HISTORY A, DBA_OBJECTS D WHERE A.SAMPLE_TIME BETWEEN SYSDATE – 10/(24*60*60) AND SYSDATE AND A.CURRENT_OBJ# = D.OBJECT_ID AND A.SESSION_ID IN (4,189) GROUP BY A.CURRENT_OBJ#, D.OBJECT_NAME, D.OBJECT_TYPE, A.EVENT ORDER BY TOTAL_WAIT_TIME; no rows selected
That can’t be the expected result. Let’s try running the following query in Microsoft Excel:
SELECT SESSION_ID, EVENT, CURRENT_OBJ#, SAMPLE_TIME, WAIT_TIME, TIME_WAITED FROM V$ACTIVE_SESSION_HISTORY WHERE SESSION_ID IN (4,189) ORDER BY SESSION_ID, SAMPLE_TIME;
After scrolling down a couple of thousand lines in the spreadsheet:
Well, that throws off the results of the query found in the book a bit. The CURRENT_OBJ# is -1 and the session has not waited any time (SUM(A.WAIT_TIME + A.TIME_WAITED) = 0 for SID 4, but would be 116,972,415 for SID 189 for the same CURRENT_OBJ# = -1).
Maybe the query found in the book just does not work for potential primary key violations? What about a case where one session updates a row and another session attempts to update the same row? Issue a COMMIT in Session 1 and Session 2. In Session 1:
COMMIT; UPDATE T1 SET C1=C1;
In Session 2, attempt to update the same row:
UPDATE T1 SET C1=C1;
Session 2 is hung while waiting for Session 1 to COMMIT or ROLLBACK. Now wait for a while. While you are waiting, let’s take a quick look at the documentation for the V$ACTIVE_SESSION_HISTORY view from the 11.2 Oracle Documentation library:
“TIME_WAITED: If SESSION_STATE = WAITING, then the time that the session actually spent waiting for that event (in hundredths of a second). This column is set for waits that were in progress at the time the sample was taken.
If a wait event lasted for more than a second and was caught waiting in more than one session sample row, then the actual time spent waiting for that wait event will be populated in the last of those session sample rows.”
OK, must check the SESSION_STATE column, only the last of those sample rows in a string of rows for the same wait event, and the time is in 1/100th (0.01) of a second. Let’s re-run the query in Excel to see the progress:
That 2,471,728,088 number in the TIME_WAITED column appears to be in conflict with the description in the Oracle Documentation library – the time appears to be reported in 1/1,000,000 (0.000001) of a second, not 1/100 of a second.
—
Back to the original topic of this article. I have yet to determine why the query subtracts 30/2880 from the SYSDATE to retrieve the activity in the last 15 minutes – wouldn’t it be more clear to subtract 15 / (24 * 60) or 15 / 1440? Considering that the SAMPLE_TIME column in V$ACTIVE_SESSION_HISTORY is a TIMESTAMP datatype, would it make more sense to use SYSTIMESTAMP rather than SYSDATE?
Finally, What is Wrong with the Quote Found in the Book?
The point of blog articles like this one is not to insult authors who have spent thousands of hours carefully constructing an accurate and helpful book, but instead to suggest that readers investigate when something stated does not exactly match what one believes to be true. It could be that the author “took a long walk down a short pier”, or that the author is revealing accurate information which simply cannot be found through other resources (and may in the process be directly contradicting information sources you have used in the past). If you do not investigate in such cases, you may lose an important opportunity to learn something that could prove to be extremely valuable.
Errm isn’t what’s wrong with it that objects don’t experience waits, sessions (or sql statements) do.
Niall,
After seeing your response I thought that I made a mistake in the title of this article, so I double-checked. The book states just prior to the query: “Execute the following query to find out the objects with the highest waits”
By the way, excellent point – even if I was looking for a bit of a different answer. This chapter of the book needs a complete re-write – I have close to 8 pages of review notes (accuracy adjustments, so to speak) just for chapter 5.
If I place your first 20 rows from your initial query, without time clause (I have 30 days AWR repository window), and order by “TOTAL_WAIT_TIME DESC” I get most of waits without “EVENT” value ….
Any explanation why?
🙂
Oracle,
Just a guess:
http://download.oracle.com/docs/cd/E14072_01/server.112/e10820/dynviews_1007.htm
I think that you may have found… possibly… another problem with the SQL statement. 🙂
THX!
Charles hit upon the main point, which can be easily missed, that the event fields are only relevant for rows that indicate waiting. ASH doesn’t always clear out the values in other fields, which leads to another confusing issue. The object fields
CURRENT_OBJ#
CURRENT_FILE#
CURRENT_BLOCK#
are only valid for certain waits, which are AFAIK
IO waits
buffer busy waits
enqueue tx waits
and even for these waits the values may be missing, -1, or 0 and in those cases the only way to get the objects is from the file and block #s, using p1 and p2 for most of the user I/O waits:
And small digestion…
With RAC (gv) query result looks like:
Hi,
Using sum for v$active_session_history is a bad thing – you can’t summarize samples it has to be counted instead – that’s first issue.
TIME_WAITED column has different meaning depend on session_state (ON_CPU or WAITING) and if one event (wait) is longer than sample period (1s)
waited time of one event will be added more than once.
All queries based on ASH or DBA_HIST_ASH should use COUNT instead of SUM – see very good presentation here – http://www.oracle.com/technetwork/database/focus-areas/manageability/ppt-active-session-history-129612.pdf.
Marcin,
Very interesting document that you linked to – if I remember correctly, the author of that document was also heavily involved in the development of ASH at Oracle Corp. I probably learned a few things from reading that document. A couple of items that stood out, which are related to this blog article:
Slide: “ASH: Top IO SQL”
Slide: “ASH data gotcha’s”: Beware of Obj#, File#, Block# (not cleared); Wait time vs Time waited
Slide: “ASH: Bad SQL”: select sum(a.time_waited) total_time
Slide: “ASH: WAIT_TIME vs TIME_WAITED”
The “ASH: WAIT_TIME vs TIME_WAITED” slide certainly explains the Excel screen captures that appear in the blog article (except for the unit of measure inconsistency).
Using COUNT rather than SUM… I should have thought of that, but then that is part of the reason why reader comments are requested for my blog articles 🙂
Charles,
Yes, Graham Wood is one of God Fathers of ASH. Another one John Beresniewicz prepared nice paper as well – http://ashmasters.com/jb-zone/average-active-sessions/ – take a look on DB Time calculation based on ASH or take a look on Kyle’s Hailey scripts http://ashmasters.com/ash-queries/
ASH is generating sample every second so if your session was locked for let say 10 s with any event event you will see 10 rows in ASH with same session_id and sql_id – of course we can’t be sure if all samples are related to same execution of SQL_ID but from statistic perspective we can ignore it
select decode(session_state,’WAITING’,event,’ON CPU’), count(*) cnt from v$active_session_history where session_id = x and sample_time > sysdate – 1/60/24 group by decode(session_state,’WAITING’,event,’ON CPU’) order by cnt
This query will show you most session utilization profile for last minute – wait time (cnt column) are not exact but if you will sample it for quite long period measure error related to sampling frequency will be decreasing over time.
regards,
Marcin
IMO Macin hit the nail on the head. ASH is good for counting and statistical approximation.
I basically stay away from timings in ASH, though they can be of occasional use.
I especially stay away from wait_time, which, as I understand it , should have been left out of ASH and is basically the basis for “On CPU” or “Waiting”.
TIME_WAITED can be used but keep in mind, wait times will be skewed towards the longer waits as shorter waits will be caught less frequently.
If one is concerned with the top objects then look at the counts of waits on those objects. Waits on objects are only going to be instrumented for 3 types of waits AFAIK
io waits – db file sequential read, db file scattered read\
buffer busy waits
enqueue tx waits
You also have to target not only the wait events, but actual waits as opposed to ASH rows that were for CPU but still have left over data in other columns from previous waits.
For IO, you could do something like
giving something like
For Object I/O by SQL how about
giving something like
(CH: Edited July 1, 2011 to adjust the ORDER BY columns per Kyle)
Kyle,
Wow, great answers and examples. In the second of the above SQL statements you have a column tcnt in the ORDER BY clause – is it OK to replace that column with aas?
Martin,
Thanks for putting together the summary. For some reason WordPress required me to authorize Kyle’s comments and your comment, thus delaying the posting of the comments by an hour, so that explains why your summary does not include any of the extra points made by Kyle.
—
No comments yet from anyone about SAMPLE_TIME being a TIMESTAMP column and any potential problems with comparing it to SYSDATE in the SQL statement at the start of this article? Are there potential issue… what about a remote DBA in another time zone?
>>is it OK to replace that column with aas?
yep, error in script , thanks for the catch. I think tcnt (total count) is a vestige from an original version before I shifted it to AAS
>>SAMPLE_TIME being a TIMESTAMP column and any potential problems with comparing it to SYSDATE
One trick that has help me in queries is
cast(sample_time as date )
not that it makes dealing with timezones any easier but it makes timestamp calculations easier
Hi Kyle,
Could you post the whole script with your “changes” according mentioned ?
Thank you
Charles,
I’ll ry a little summary so far:
* you showed how a _single_ wait can be SUMmed multiple times. Do you think the max WAIT_TIME or TIME WAITED in a direct series of ( equal EVENT _AND_ [SEQ# constant increasing] ) can solve that particular issue? (or at least reduce the risk)
* we know now, “ON CPU” is not an object
* the document refuses any information about current_obj#, it the wait is not part of “application, cluster, concurrency, and user I/O wait events”
* the documentation seems to be wrong in TIME WAITED as “hundredths of seconds”
* Niall is right, “objects don’t experience waits” – but let’s keept it similar to “hot blocks”?
* Marcin is right (even not stated directly): “the most wait time” can never be provided ba a sampling mechanism as ASH. Even unlikely, there could be a crazy high number of other waits on a single object just between the samples.
Do you expect any more comments?
Hi there
nice post to say the least …
I do have a website of mine on http://www.zion-dba.com with some charts done using AWR data
best to all
patrick boulay
(Edit C. Hooper August 3, 2011: Please note that comments on this blog MUST refrain from advertising commercial products. It does not appear that the intention of the above link is to advertise a commecial product, but instead to share a data analysis concept that could be implemented in Microsoft Excel. If it is determined that this comment is in fact a commercial advertisement, this comment will be treated as spam, and will disappear.)