Battling the Symptoms or Addressing the Root Cause

3 04 2010

April 3, 2010

A non-Oracle specific question arrived in an email from an ERP mailing list – I think that the user who wrote the email was probably running SQL Server, but that probably does not imply much other than potential differences in read-consistency and trigger code when compared to a user running the same ERP package with Oracle Database.

Paraphrasing the question:

I need to be able to automate the running of a utility (VMFIXOHQ) on a Windows client computer.  The utility does not offer a command line interface for specifying parameters, so the method of automation must be able to enter text into screen fields, click program buttons, and activate menu items so that the utility will automatically run on a nightly basis.  A program named Automate is able to accomplish this task, but is too expensive for this specific task.

I have in the past written task schedulers that would do exactly what the author of the email requested, but I did not offer the task scheduler to the original poster.  Why, well it is hard to describe why.  With a lead-in, I offered the following analogy:

IT Guy: “Doctor, I have a splitting headache and I seem to be having trouble remembering things.”
Doctor1: “Let me write you a prescription for a new desk with a padded writing surface.”
IT Guy: “OK, but I don’t see how that will help my headache.”
Doctor1: “You just told me that you only have splitting headaches while sitting at your desk.”
IT Guy: “That new desk works great.  The headaches still happen, but don’t last quite as long.”
Doctor1: “Now let me prescribe something to cure the imprint of the paperclip and the stapler on your forehead.”
Doctor2:Wouldn’t it be easier to find out why he keeps banging his head on his desk?

In the above analogy, Doctor1 was treating the symptoms of the problem.  Maybe he notice the IT guy’s red forehead, and thought that if the IT guy must bang his head on his desk, he really should have a softer surface for his forehead to hit.  Once the original problem was mitigated, a secondary, related problem remained – obviously, the IT guy should make certain to clear his desk before banging his head.

Doctor2, on the other hand, suggested a root cause analysis.  If the IT guy is banging his head on his desk, determine what triggers the IT guy to bang his head on his desk.  Maybe he can’t find the flyswatter.  Maybe he once hit his head and then by coincidence found a solution to a perplexing problem.  Maybe he is frustrated (he might have worn too small of a size of shoes, causing his feet to hurt).  Maybe someone is forcing him to bang his head?  Wouldn’t it be better to find out why, rather than just trying a number of things that might make the problem less severe, but never actually fix the problem?

An Oracle database example of this is simply throwing hardware at a performance problem because a root cause analysis is perceived as requiring too much time and being too expensive (computer hardware costs are decreasing while at the same time IT labor costs are increasing).  Sure, replace the server with one having 4 times as many CPUs and 4 times as much memory – after all, hardware is cheap compared to the perceived cost of a root cause analysis (at least that is what it says on the news).  Forget that such a cheap upgrade will require 4 times as many Oracle Database CPU licenses, accompanied by 4 times as much for annual Oracle support/maintenance fees.  On second thought, maybe a root cause analysis is really a much better and less costly approach, no matter if the performance problem is caused by a change to daylight savings time, someone verbally abusing the SAN, an upgrade of the Oracle Database version, or something else.

It might seem that I drifted a bit from the topic of the email that arrived from the ERP mailing list about scheduling the execution of the VMFIXOHQ utility.  That utility is not one that should be run daily, not one that should be run weekly, not one that should be run monthly, and not even one that should be run yearly (this doesn’t sound like anything in the Oracle Database universe, does it?).  That utility has a very specific purpose – it fixes the results of application and/or database trigger bugs that caused the stored on hand inventory counts for a specific part to differ from what is sitting on the shelf.  More accurately, a transaction is recorded in the database whenever parts are added to inventory, removed from inventory, or moved from one warehouse location to another, causing the on hand inventory counts for the parts are adjusted accordingly.  This VMFIXOHQ utility runs through these transactions from day 1 and effectively determines how many of the part should be sitting on the shelf based on the supporting inventory transactions.  Scheduling the running of the VMFIXOHQ utility does not address the real reason for the inventory counts being inaccurate; rather it is a band-aid (a padded desk, if you will) for the real problem – a code bug, missing trigger, improperly handled deadlock, or multi-session read-consistency issues.

Was I wrong not to tell the original poster how to schedule the running of this utility?  :-)


Actions

Information

3 responses

5 04 2010
7 04 2010
joel garry

Were you wrong? Quite possibly :-)

I have an analogous problem in an AR system, bug “fixed in the next release” for several major releases now. Still cheaper to run the special utility on the weekend than fix a problem that is fairly difficult, obscure and most importantly, not much business impact. Sometimes “good enough” actually is. To continue your analogy, Dr. 2 finds evidence of a major inoperable stroke… better to find out, but only because it could perhaps have found a less severe issue.

I also have inventory issues quite similar to what you describe, but they tend to be like squeezing a balloon as versions are increased, slowly deflating and changing in shape, so we don’t have to run the utility fixes any more, but sometimes have to go in and fix things manually – and most come from user error (trying to fix things) piling on user error (not following directions). A smaller number come from forcing MS style locking code on an Oracle db.

As far as the Oracle performance, I totally agree, except to note that sometimes the system is well tuned and actually benefits from more horsepower. This tends to make the “throw hardware at it” people look like geniuses, and if the presupposition is reasonable (as in, someone competent has actually been keeping the system tuned as the transaction volume increases, and the capacity planning was accurate in the first place), that look isn’t unreasonable.

8 04 2010
Charles Hooper

Joel,

Interesting comments you provided. Yes, there could be much more going on in the analogy that I provided. Maybe the IT guy is not beating his head on his desk out of frustration, but Doctor1 would not know about that until he writes the IT guy a prescription for a wide-brimmed hat to prevent the guy’s forehead from slamming into the desk, writes a prescription for a comb to deal with “hat head”, and then discovers that the comb did not help solve anything related to the original problem.

I had a chance to play Doctor1 yesterday on a BlackBerry phone that not only stopped receiving emails from the BlackBerry Enterprise Server (BES), but stopped displaying the 50+ emails that were on the user’s desktop computer. Not a problem (even though there is no simple resync option THAT WORKS), I will just decommission/lock the BlackBerry, delete the user from the BES server, re-add the user to the BES server, reassign the BlackBerry to the user using BES, set the default IT policy for the BlackBerry, and we are in business again.

To the user: “Look, it received the test email I just sent.” The user: “What about the 50+ emails that are on my desktop?” Wash, rinse, repeat 5 or more times. To the user: “Look, it now shows 50+ emails and your calendar.” The user: “What about the blank task list, the blank contact list, the pictures that were on the phone, the saved password stored in the phone’s password wallet, the speed dials, the special font size, the reprogrammed buttons, the relocated icons, the …” To the user: “But at least you are able to view your emails (the original problem), and the phone still works as a PHONE.”

So, what might Doctor1 do next. You guessed it, shoot in the dark and disable synchronization of the user’s task list for 12 hours to see if it helps, or maybe see how the user would look wearing a wide-brimmed hat. :-)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s




Follow

Get every new post delivered to your Inbox.

Join 148 other followers

%d bloggers like this: