Alexandre Polozoff on WebSphere Performance

Backup Strategies

2011-03-30T07:03:00.000-07:00

I was having a discussion recently with one of my colleagues around server backups. He likened it to the spare tire of the car. Yes, you can drive around without the spare tire but does anyone want to do that for long stretches of time? Probably not.

Servers need to be backed up. Simply if a server completely tipped over it can be duplicated, rebuilt and put back into service.

tprof switches

2011-02-07T12:32:00.001-08:00

I wanted to capture this before I forgot about it ... tprof switches

Cognos

2010-11-30T06:10:00.000-08:00

Looking for information about Cognos? I was. My colleague in the UK, Richard Collins, was able to point me to some informational links about Cognos.

For full details on this product, see its home page here: http://www-01.ibm.com/software/analytics/cognos/business-intelligence/

However, your most useful starting point is probably going to be: http://www-01.ibm.com/support/docview.wss?uid=swg27014432

...and in particular, this one: http://www-01.ibm.com/support/docview.wss?uid=swg27019126

global transaction sharing

2010-11-09T17:49:00.000-08:00

In the developers resource references there is an attribute for global transaction sharing.The value should be Shareable if the application uses global transactions. But if one is using LTC then one does not need to share the connection. Change the res-sharing-scope to Unshareable as noted above. This eliminates a lot of contention for connection pool threads. For more details on LTC see the link above to understand how it works.

The importance of dynamic circuit breakers

2010-09-24T05:24:00.000-07:00

It is rare for anyone to provide details behind the root cause of a production outage. Facebook put out a report about an outage they had. If you're into troubleshooting and problem determination it is an interesting read. It sounds like they could turn off the particular function but had to completely restart the environment to do it. This is why it is important to have circuit breakers that can be activated dynamically.

One also wonders what infrastructural changes could be made in the environment to help? It sounds like the application logic continued to retry requests. This is why I'm not a fan of applications automatically retrying requests because when failure occurs the retries can quickly overwhelm the back-ends. A firewall could have at least help shut off the pipe to the database. Though the consequences to the application would have been no different and would still have required a restart since there seemed to be no way to dynamically shut off that particular function.

Certainly the error logic sounded confusing at best. And error paths through code are the ones least frequently tested so they tend to fail magnificently in production

grep Exception SystemOut.log | wc -l

2010-09-14T15:46:00.000-07:00

I'm reminded today that after performance tests a simple check exists especially when adding more JVMs to a cluster. Count the number of exceptions in the log files. Of course, clear the logs before running the test. If the counts are not all roughly the same (or significantly skewed from the other app servers) then it is clear there are issues with that JVM that need to be checked. Sometimes it is configuration or a misplaced JAR file.

Been a busy 2010

2010-06-08T04:33:00.000-07:00

I know I haven't kept up with the blog this year. While the blog post I'm linking to today makes no mention of performance it has everything to do with performance. Maybe one day I'll get a chance to sit down and explain my thinking.

Catching up on closing up 2009

2009-12-24T06:53:00.000-08:00

It has been a busy few weeks for me which is one reason I haven't posted in a while. Some things I've been thinking about lately revolve around high availability, switches and logical partitioning of applications and application servers. I'll try to post more on these subjects over the next few weeks.

Hope you all have a great holiday season and a Happy New Year. I'll see you on the other side of 2010!

The people side of performance

2009-11-10T09:18:00.001-08:00

Those of you that read my blog know that I tend to primarily address technical topics when it comes to performance. The latest issue of Communications from the ACM (10/2009 vol 52 No 10) has an article called "The Business of Software Contagious Craziness, Spreading Sanity" by Philip G Armour. I was intrigued with some of the points Philip brought up in his article. Particularly this tidbit

"... having a strong conviction that something cannot be done is usually a self-fulfilling prophecy. If people are convinced that something is not achievable, then they usually won't achieve it - if we argue for our limitations, we get to keep them."

Wow! Nail on the head. So how do we convince an organization that something that is currently perceived as impossible actually is quite achievable?

First off, people have be encouraged to think outside the box. George Patton is quoted as saying "If everyone is thinking alike, someone isn't thinking." Coming from the battlefield I think we can understand why the General said this. He needed his commanders to not only visualize the current plan of operations but also what alternatives and/or competing strategies were available. One of my friends, Keys Botzum who also happens to be an IBM STSM, once put it this way. "The patient is dying on the table. What can we do, right now, to keep him from dying." It is a different perspective for techies to mull over.

Likewise, a technical team can not be lead by consensus. Technical teams have be be lead by dictatorship. Someone, at the top, needs to make technical decisions and take responsibility for those decisions. Right or wrong someone, like General Patton, has to tell the troops which way to charge. He can take input from his Lieutenants however the decision is ultimately his and his alone to ensure it is the correct decision. Which is why the other side, led by mad men who knew nothing about how to fight a war, lost. The General knew what he was doing and knew how to lead his people to victory.

Which brings us to our next topic. In order to be able to lead a technical team the leader has to be technical. I understand management's philosophy that they are managing people and not the technology. The problem is that technology, by its very nature, is complex. This is simply unavoidable. We're not managing a bunch of lawyers (sorry to pick on lawyers, you're used to it by now, as they are a good example). What lawyers do is not very hard to comprehend. They sue people. They defend people. They argue in court for their client. They draw up contracts, etc. None of this is rocket science. Start running multiple servers in parallel handling thousands of transactions per second and that organization is starting to approach NASA-like technical complexity issues. In order for a technical team to succeed they must be lead by a strong technical person. It is unreasonable to expect someone who does not comprehend the subtleties of technical subjects such as "verbose GC" or "JDBC deadlocks" to be capable of making a technical decision much less the correct one.

In conclusion, and back to the "self-fulling prophecy" and "that something is not achievable" that started this line of thinking for me. There are two ways to avoid failure. One is to have a strong technical leader who can take the reins as the technical General who makes the decisions. The other is for executive management to support the technical leaders. This means not only in terms of vocal agreement but also in terms of financial dollars to the project, making people available to do the tasks and any other strings that need to be pulled in order to make things happen.

I'll have more to say on this subject later as this is about all I can muster in the limited time I had for this "lunch & learn" session.

Managing the application threads

2009-11-05T07:52:00.000-08:00

Threads and multi-threaded programming is not a trivial subject. But some application developers find themselves having to start their own threads as part of their application.

I'll start this post off saying that using unmanaged threads is never a good thing. When using threads developers should be looking at how to manage all their threads in their application. Never exit the application without terminating all of your threads. Why would we need to do this?

An interesting problem is when an application is stopped (not the JVM/app-server but just the application), a new version is deployed, but the restart fails because there are still threads hanging around from the previous version of the applications and older versions of the same classes. I never would have imagined this problem occurring but I can see that if the application lost track of its threads that it would not be able to close them when it shut down.

Java provides the capability to manage all the application threads by using thread groups. When creating a thread and assigning the application's thread to a thread group all the application has to do is capture the "application stop event" and use the stop method of the thread group thereby stopping all the threads in the thread group.

The beauty of this solution is that the application can reliably manage all of the threads it creates though a single interface. It is also portable which means your application will work with any application server.

High availability through parallel cells

2009-11-05T06:06:00.000-08:00

High availability sites have unique requirements that are not easy to solve. However, one solution does revolve around using multiple cells in production. This allows for easily taking a cell out of rotation to make changes. Then when the cell goes live and if the changes didn't work it is just as easy to take the cell out of rotation again. I've used this strategy at many, many 24 x 7 sites.

Here is the link to my latest article on multiple cells.

The only problem that arises is if things like database schemas change in the backend which also means application code changes. Those kinds of changes require more work and possibly parallel schemas until the transition is complete.

Running out of disk space

2009-10-19T13:50:00.000-07:00

An event occurred this morning that reminds me that performance problems are sometimes environmentally driven. One application went down with various errors related to the file system. After some investigation it was discovered that the /tmp space ran out of space. Yet no one knew it had happened. Turns out that no monitors are in place to watch for resource exhaustion like low disk space.

One could argue why did /tmp fill out if daily (automated) housekeeping tasks were in place (they were not). But that is a pointless argument if someone moved a whole bunch of files in /tmp. They could have moved files in there to install them and then delete when the install is done. Or they could have been log files being transferred to another environment via ftp. In any case, an automated monitoring tool should have alerted the operators that disk space has come close to exhaustion and need to take action to remedy the situation.

If automated monitoring is not in place then the production applications that make use of /tmp space will fail causing an outage. And without alerts about low disk space it can take several hours for the operations and troubleshooting teams to figure out that the application failed because it could not write out a temporary file. A several hour outage that could have been completely avoided had the right resource monitors been in place.

Other resource monitors should be looking at CPU utilization, page file activity, etc.

IPv6

2009-09-30T08:36:00.000-07:00

Last year I blogged about handling ipv6 lookups that caused threads to wait thus slowing them down. Not surprisingly I've come across this a few times since then. What is interesting is the effect it can have on throughput and response time. Testing showed that once -Djava.net.preferIPv4Stack=true was enabled the test had 8% improvements in both throughput and response time. This is truly interesting data. That is a big difference and a good reason to periodically take thread dumps even if everything seems to be running nominally. One never knows when they may uncover a gem of a performance boost like this one.

PMI = all is not a good setting for production

2009-09-24T09:34:00.000-07:00

I am reminded occasionally when debugging production issues that setting the PMI level in WebSphere Application Server to "all" is not a good thing to do. At the "all" setting one can see an application exhibit negative behavior. I am also learning no two applications will always exhibit the same problem. Some applications crash and burn under the weight of PMI=all incapable of providing a response to any request. Other applications seem to continue to function nominally but the CPU for those processes seem to be considerably higher.

Finally, regardless of the PMI settings configured in the production environment it is imperative that all members of a cluster are set to the same values. Having some processes set to one level and other processes set to another results in higher CPU utilization from some JVMs and not others.

1 minute garbage collection cycles

2009-09-23T20:34:00.001-07:00

Does your application use RMI? Are you seeing garbage collection cycles with exclusiveaccess every minute (60,000 ms)? You may need to apply the following two parameters to the JVM command line.

-Dsun.rmi.dgc.client.gcInterval=360000000 -Dsun.rmi.dgc.server.gcInterval=360000000

Essential log attributes to log - duration of a request in access.log

2009-09-22T17:13:00.000-07:00

It is absolutely imperative that the access.log file from your Web server record the total response time as measured by the Web server. This is available in all flavours of Web servers. It provides definitive response time numbers as they leave the Web server that can be used as a second data point to the application monitor's servlet response time. The log entry also allows sysadmins a chance to put their log monitors on it to alert on long response times.

Default values - they're not for everyone

2009-09-22T16:56:00.000-07:00

I am frequently asked about about the significance, or lack thereof, of default values to one or more configuration items. People need to remember that default values are simply starting points so the environment can be brought up. For example, an operating system has the default value for TCP Keep-Alives set to 2 hours. According to RFC 1122 this is an acceptable default value. However, if you look at the acceptable ranges of values it starts at 10 seconds and can be set as high as 10 days. So, obviously, the default value is not going to work for everyone. Some sites may need to set it low to around 10-15 seconds. Other sites might need a 2 or 3 day setting.

Additionally a comment was made in general about setting the value of TCP Keep-Alives can not be lower than 2 hours. I think it was a misreading of the RFC specification that reads "This interval MUST be configurable and MUST default to no less than two hours." Read that the DEFAULT must not be less than two hours. This does not imply that the value can not be lower than two hours. The lesson learned there is to read the specifications to the letter. Unfortunately I think the emphasis in the RFC on the word "MUST" does distract the reader from the word that follows "default" and can subtly mislead the reader. Erroneous information is consequently passed on as a rule. Unfortunately, if no one backtracks reads the RFC and verifies the rule then rampant disinformation is spread and becomes written in stone.

Trust but verify every setting in your environment. Every setting should be tested as thoroughly as possible. Documentation should then record what was tested and what values were selected over others and why. This documentation will be valuable to the group of people maintaining the application 5 years from now and I can guarantee it will not be you the reader. It will be whoever is watching the store after you have moved on to new and exciting career opportunities.

Enable verbose GC

2009-09-16T08:31:00.001-07:00

http://www-01.ibm.com/support/docview.wss?rs=180&uid=swg21114927

I really recommend running in production with verbose GC enabled. The use of the term verbose is unfortunate as it is not as verbose as people think it is. And the data collected is invaluable.

If you have any doubts, enable verbose GC in test and see if you can measure any impact from enabling it.

socketread0 timeout on JDBC connections

2009-09-16T08:11:00.000-07:00

Since I run into this problem every once in a while I'm putting up a note here to help others when they run into this problem. JDBC calls can sometimes get stuck on socket read calls to the database if some rather nasty network problems exist. The only way to determine if network problems exist is to use tcpdump (AIX, Linux) or snoop (Solaris) to capture the packets. One can then use Wireshark (or its predecessor Ethereal) to read the capture files. If you see issues like "unreassembled packets", "lost segments", "duplicate ACK" or checksum errors then most likely the network is having some abhorrent behaviour affecting the server. If random threads hang on socketRead0 calls that never seem to get a response then the only way to deal with this is through timeouts.

On DB2 follow use this parameter:

http://publib.boulder.ibm.com/infocenter/db2luw/v9r7/index.jsp?topic=/com.ibm.db2.luw.apdv.java.doc/doc/r0052038.html

blockingReadConnectionTimeout
The amount of time in seconds before a connection socket read times out. This property applies only to IBM Data Server Driver for JDBC and SQLJ type 4 connectivity, and affects all requests that are sent to the data source after a connection is successfully established. The default is 0. A value of 0 means that there is no timeout.

For Oracle use:

oracle.jdbc.ReadTimeout

http://forums.oracle.com/forums/thread.jspa?messageID=2326985

Logs, log identification, log monitoring

2009-08-18T21:15:00.001-07:00

Logs are an ultimate source of information to the health of an application. Logging data helps understand if something has gone wrong and what might have gone wrong. The problem is that there are two sides to logging.

Developers provide the first level which are the log statements themselves. I have always found that a major/minor log code (i.e. (error/warn/info/trace = major; minor = module/method/logic) to be the most straight forward way to provide log information. Identify the type of log message being recorded to a specific category.

For the operations side of the house they can monitor logs and based off of known major log codes they can decide if they need to take action. A major code that indicates an error can be immediately dispatched to a field technician. While other levels can be monitored or collected if in troubleshooting mode.

However, if an application development team does not provide any well defined codes for log messages and only provide free form text then the operations/runtime side has difficulty determining if a specific log entry is a problem or not. Unfortunately, most application development teams choose the free form text logging as their solution which makes runtime operations that much more difficult and, in many cases, confusing.

Keeping notes (AKA your operational run book)

2009-07-28T06:37:00.000-07:00

This morning I was reminded about the importance of keeping notes about the problems I encounter and the solution that provided the fix. A co-worker of mine researching a problem came across my blog post through a google search. It seems their environment is suffering from the same symptoms. Had I not kept a note about that issue they would have most likely gone through the same level of effort I originally did to research and find the solution. Hopefully when they apply the fix it resolves their problem (I think it will) and they didn't have to spend several hours/days to get to resolution.

This brought to mind the importance of a run book. If you are not familiar with the term a run book is basically an operations manual. It spells out what needs to be done for various tasks. For example, if we need to deploy a new application into production the run book has a full set of instructions (a recipe if you will) on what to do. Likewise, when the trouble shooter debugs a problem they record in the run book what the problem was, the steps they took to determine root cause, the fixes they tried and which one(s) finally worked. This way should the problem reoccur and a different shift of people see the same problem they can refer to the run book and go through the same steps.

The run book does not have to only address technical details. It can also provide operational response instructions like how to run a war room,who needs to be involved, when various organizations/teams are engaged, what triggers an engagement, etc, etc. Every operational detail can be recorded in the run book.

Does your production environment have a run book?

Finding information (like tuning guides)

2009-07-20T08:33:00.000-07:00

I frequently get queries about finding information on one topic or another (generally around performance or application monitoring). A co-worker in Australia was looking for some general WebSphere Web services tuning guides. IBM.com has a wealth of information on a variety of topics so I thought I would put together this post on how to search for information.

This particular google search is one I used to find information for Web services tuning.

You'll note the key in the search string is the beginning "site:ibm.com" which restricts the search to only those items found at ibm.com.

Latest article

2009-07-20T08:31:00.001-07:00

I forgot to post when my dW article came out. This is a new series I'm going to write on defensive architectures. Part 2 is being reviewed by my colleagues right now.

Problem Determination - High CPU strategies on Windows OS

2009-07-09T12:37:00.000-07:00

I've been quiet the past couple of months but that is because I have an article coming out on deveoperWorks soon. I'll post a link to that when it is ready.

A colleague called me today to talk about a high CPU scenario and steps to take to try and resolve it. My first recommendation was to get thread dumps of the JVM when it hits the high CPU scenario. Then I found out the JVM is running on a Windows OS. Since it is not possible to execute a kill -3 on Windows and one has to use the script methodology with no CPU available it is tough to get a thread dump.

I suggested he collect thread dumps as the CPU starts to climb. This works if they can (a) predict when the CPU will start to climb and (b) it climbs slowly enough to be able to collect javacores along the trajectory. It sounds like the CPU can spike in a matter of seconds even at stead load volumes. Ugh.

My final thought is to reduce the thread pool max in half and rerun the test. Perhaps there are simply too many threads executing and nothing will prevent it from going 100% CPU. Though I have seen code that can drive even an 8-way to 100% with just a few threads. I'm waiting to hear back. But I think the latter strategy of reducing the thread pool max will be the best course of action for them. I expect an email in the morning.

Predicting Capacity

2009-03-26T15:13:00.001-07:00

I have had a number of calls around "We have an application, not built yet, but we want to know if we have enough capacity and/or will the application scale/perform in our environment."

My analogy to this is the following. Someone shows me the blueprint to a Formula 1 (or NASCAR) race car. They ask me "will this car win the race?"

While the car being proposed may have all the right components (wheels, engine, gearbox, etc) there are other unknown variables that determine if the car can win a race. Without extensive testing there's no way to know if all the components are assembled correctly nor if they are configured to work together in optimal manner (performance tuned).

Related to this is the ability of the car to both get off the starting line and finish the race (a prerequisite for winning). Again, there are a number of variables that impact this. First is some catastrophic failure in one of the components. Then there is the skill of the car's driver. Then the other drivers on the same racetrack and crashing into our car.

There is simply no way to predict if the car can win much less finish a race.

The same attitude has to be taken when planning your production IT environment. You can't predict behaviour. You can, on the other hand, conduct stringent system integration and performance testing in order to see if your application will (or will not) perform as expected in production.

This is why everyone should test as early in the development cycle as possible. I was working with one application that had been in production for several months with constant, recurring failures. There were fundamental application design problems in how the application was developed. It took us four more months to fix those design problems and get the final, new version of the application into production. Don't put off testing. Do it right. Do it early. Test early. Test often. That is the key to success in production.