Welcome to Business Management


Wednesday, October 10, 2007

 

Occam's Razor at 2am - Help Desk Escalation

Occam’s Razor at 2am (Incident Management)
The Principle:
Occam’s Razor is a principle attributed to the 14th-century English logician and Franciscan friar William of Ockham. The principle states that the explanation of any phenomenon should make as few assumptions as possible, eliminating those that make no difference in the observable predictions of the explanatory hypothesis or theory. Many people have heard it phrased more commonly this way “All things being equal, the simplest solution tends to be the right one," or alternately, "we should not assert that for which we do not have some proof." In other words, when multiple competing theories are equal in other respects, the principle recommends selecting the theory that introduces the fewest assumptions and postulates the fewest entities. It is in this sense that Occam's razor is usually understood.
Now for the story:
It happens when you least expect it and are sleeping (for most Problem Escalation Teams).
You get the call (at 2am) that something is not working and you have to dial in or participate on a call.
At this time, a series of people have already attempted to resolve this problem. It is very likely that they have tried simple things and that if the application is downed for more than several hours; they have moved into more complex solutions. I have found that in these calls, we typically fail to answer 4 questions.
1. When was the last time this was working correctly?a. In my line of work, it was usually working within the last 12-24 hours.b. Its relevant because things don’t break for “no reason”…the cause may not be known, but it usually happens from an action, or omission of an action.c. Is it working correctly is some locations and not others (i.e. Web based architecture is broken, but local networks are up.2. What Incidents were opened today (check all resources)?a. We had several different queues and people that helped in different locations.b. Call ANY resolver and ask them if there touched anything today.3. What upgrades or implementations occurred or were ATTEMPTED?a. This can contribute to problems that were missed in Testingb. Attempts can cause breaks, but if it is not rollback, or not rollbacked correctly this can cause unknown issues.4. When was the last time this server was rebooted?a. Windows Patching can cause issues since the testing on these is not rigorous.
These 4 questions usually lend themselves to resolution. At one 2am call, the IT team had been working for an extensive time (13 hours) and they were getting ready to rollback patches from 2 weeks ago, when I entered the call. I asked the four questions mentioned above and found some compelling information.
It was at question 2, that we took a step to resolution. Earlier that day, someone had opened a ticket where the root cause of the incident was a missing .exe. The Resolver did nothing wrong by replacing the missing .exe. He resolved the incident as he should have.
I asked our IT guys to run a directory compare of .exes and .dlls and found some missing items from a working app (another site) to the broken app. We found 3 things missing. We copied them back in and magically things started working again.
These 4 questions have helped me immensely but also helps focus where to start looking. In effect, everyone is looking for what changed. This helps refine the search and brings folks into the loop on what occurred. It is my contention that after a few hours of resolving a problem, we tend to go deeper, when in reality; we might want to consider more shallow, back to basics.

Comments: Post a Comment

Subscribe to Post Comments [Atom]





<< Home

This page is powered by Blogger. Isn't yours?

Subscribe to Posts [Atom]