Tuesday, July 28, 2009

Keeping notes (AKA your operational run book)

This morning I was reminded about the importance of keeping notes about the problems I encounter and the solution that provided the fix. A co-worker of mine researching a problem came across my blog post through a google search. It seems their environment is suffering from the same symptoms. Had I not kept a note about that issue they would have most likely gone through the same level of effort I originally did to research and find the solution. Hopefully when they apply the fix it resolves their problem (I think it will) and they didn't have to spend several hours/days to get to resolution.

This brought to mind the importance of a run book. If you are not familiar with the term a run book is basically an operations manual. It spells out what needs to be done for various tasks. For example, if we need to deploy a new application into production the run book has a full set of instructions (a recipe if you will) on what to do. Likewise, when the trouble shooter debugs a problem they record in the run book what the problem was, the steps they took to determine root cause, the fixes they tried and which one(s) finally worked. This way should the problem reoccur and a different shift of people see the same problem they can refer to the run book and go through the same steps.

The run book does not have to only address technical details. It can also provide operational response instructions like how to run a war room,who needs to be involved, when various organizations/teams are engaged, what triggers an engagement, etc, etc. Every operational detail can be recorded in the run book.

Does your production environment have a run book?

Monday, July 20, 2009

Finding information (like tuning guides)

I frequently get queries about finding information on one topic or another (generally around performance or application monitoring). A co-worker in Australia was looking for some general WebSphere Web services tuning guides. IBM.com has a wealth of information on a variety of topics so I thought I would put together this post on how to search for information.

This particular google search is one I used to find information for Web services tuning.

You'll note the key in the search string is the beginning "site:ibm.com" which restricts the search to only those items found at ibm.com.

Latest article

I forgot to post when my dW article came out. This is a new series I'm going to write on defensive architectures. Part 2 is being reviewed by my colleagues right now.

Thursday, July 9, 2009

Problem Determination - High CPU strategies on Windows OS

I've been quiet the past couple of months but that is because I have an article coming out on deveoperWorks soon. I'll post a link to that when it is ready.

A colleague called me today to talk about a high CPU scenario and steps to take to try and resolve it. My first recommendation was to get thread dumps of the JVM when it hits the high CPU scenario. Then I found out the JVM is running on a Windows OS. Since it is not possible to execute a kill -3 on Windows and one has to use the script methodology with no CPU available it is tough to get a thread dump.

I suggested he collect thread dumps as the CPU starts to climb. This works if they can (a) predict when the CPU will start to climb and (b) it climbs slowly enough to be able to collect javacores along the trajectory. It sounds like the CPU can spike in a matter of seconds even at stead load volumes. Ugh.

My final thought is to reduce the thread pool max in half and rerun the test. Perhaps there are simply too many threads executing and nothing will prevent it from going 100% CPU. Though I have seen code that can drive even an 8-way to 100% with just a few threads. I'm waiting to hear back. But I think the latter strategy of reducing the thread pool max will be the best course of action for them. I expect an email in the morning.