Tuesday, August 18, 2009

Logs, log identification, log monitoring

Logs are an ultimate source of information to the health of an application. Logging data helps understand if something has gone wrong and what might have gone wrong. The problem is that there are two sides to logging.

Developers provide the first level which are the log statements themselves. I have always found that a major/minor log code (i.e. (error/warn/info/trace = major; minor = module/method/logic) to be the most straight forward way to provide log information. Identify the type of log message being recorded to a specific category.

For the operations side of the house they can monitor logs and based off of known major log codes they can decide if they need to take action. A major code that indicates an error can be immediately dispatched to a field technician. While other levels can be monitored or collected if in troubleshooting mode.

However, if an application development team does not provide any well defined codes for log messages and only provide free form text then the operations/runtime side has difficulty determining if a specific log entry is a problem or not. Unfortunately, most application development teams choose the free form text logging as their solution which makes runtime operations that much more difficult and, in many cases, confusing.

Tuesday, July 28, 2009

Keeping notes (AKA your operational run book)

This morning I was reminded about the importance of keeping notes about the problems I encounter and the solution that provided the fix. A co-worker of mine researching a problem came across my blog post through a google search. It seems their environment is suffering from the same symptoms. Had I not kept a note about that issue they would have most likely gone through the same level of effort I originally did to research and find the solution. Hopefully when they apply the fix it resolves their problem (I think it will) and they didn't have to spend several hours/days to get to resolution.

This brought to mind the importance of a run book. If you are not familiar with the term a run book is basically an operations manual. It spells out what needs to be done for various tasks. For example, if we need to deploy a new application into production the run book has a full set of instructions (a recipe if you will) on what to do. Likewise, when the trouble shooter debugs a problem they record in the run book what the problem was, the steps they took to determine root cause, the fixes they tried and which one(s) finally worked. This way should the problem reoccur and a different shift of people see the same problem they can refer to the run book and go through the same steps.

The run book does not have to only address technical details. It can also provide operational response instructions like how to run a war room,who needs to be involved, when various organizations/teams are engaged, what triggers an engagement, etc, etc. Every operational detail can be recorded in the run book.

Does your production environment have a run book?

Monday, July 20, 2009

Finding information (like tuning guides)

I frequently get queries about finding information on one topic or another (generally around performance or application monitoring). A co-worker in Australia was looking for some general WebSphere Web services tuning guides. IBM.com has a wealth of information on a variety of topics so I thought I would put together this post on how to search for information.

This particular google search is one I used to find information for Web services tuning.



You'll note the key in the search string is the beginning "site:ibm.com" which restricts the search to only those items found at ibm.com.

Latest article

I forgot to post when my dW article came out. This is a new series I'm going to write on defensive architectures. Part 2 is being reviewed by my colleagues right now.

Thursday, July 9, 2009

Problem Determination - High CPU strategies on Windows OS

I've been quiet the past couple of months but that is because I have an article coming out on deveoperWorks soon. I'll post a link to that when it is ready.

A colleague called me today to talk about a high CPU scenario and steps to take to try and resolve it. My first recommendation was to get thread dumps of the JVM when it hits the high CPU scenario. Then I found out the JVM is running on a Windows OS. Since it is not possible to execute a kill -3 on Windows and one has to use the script methodology with no CPU available it is tough to get a thread dump.

I suggested he collect thread dumps as the CPU starts to climb. This works if they can (a) predict when the CPU will start to climb and (b) it climbs slowly enough to be able to collect javacores along the trajectory. It sounds like the CPU can spike in a matter of seconds even at stead load volumes. Ugh.

My final thought is to reduce the thread pool max in half and rerun the test. Perhaps there are simply too many threads executing and nothing will prevent it from going 100% CPU. Though I have seen code that can drive even an 8-way to 100% with just a few threads. I'm waiting to hear back. But I think the latter strategy of reducing the thread pool max will be the best course of action for them. I expect an email in the morning.

Thursday, March 26, 2009

Predicting Capacity

I have had a number of calls around "We have an application, not built yet, but we want to know if we have enough capacity and/or will the application scale/perform in our environment."

My analogy to this is the following. Someone shows me the blueprint to a Formula 1 (or NASCAR) race car. They ask me "will this car win the race?"

While the car being proposed may have all the right components (wheels, engine, gearbox, etc) there are other unknown variables that determine if the car can win a race. Without extensive testing there's no way to know if all the components are assembled correctly nor if they are configured to work together in optimal manner (performance tuned).

Related to this is the ability of the car to both get off the starting line and finish the race (a prerequisite for winning). Again, there are a number of variables that impact this. First is some catastrophic failure in one of the components. Then there is the skill of the car's driver. Then the other drivers on the same racetrack and crashing into our car.

There is simply no way to predict if the car can win much less finish a race.

The same attitude has to be taken when planning your production IT environment. You can't predict behaviour. You can, on the other hand, conduct stringent system integration and performance testing in order to see if your application will (or will not) perform as expected in production.

This is why everyone should test as early in the development cycle as possible. I was working with one application that had been in production for several months with constant, recurring failures. There were fundamental application design problems in how the application was developed. It took us four more months to fix those design problems and get the final, new version of the application into production. Don't put off testing. Do it right. Do it early. Test early. Test often. That is the key to success in production.

Wednesday, February 18, 2009

Too many testers?

Wow.  I never thought I'd see a statement so outrageous.  Someone, in this case Microsoft, has too many testers?  Can this be true?  Is this why Windows has fix after fix every week because their testing is so good because they have enough testers that they can get rid of excess testers? 

I can't believe that is even remotely true.  Most organizations suffer from having not enough testers.  This is why applications deployed in production suffer.  Something had to be cut out of the test plan because they either couldn't test something or didn't have the time/people. 

A technology enterprise should not cut technical folks out of their organization. 

I, Cringely - Cringely on technology
"There are too many testers"