Wednesday, September 30, 2009


Last year I blogged about handling ipv6 lookups that caused threads to wait thus slowing them down. Not surprisingly I've come across this a few times since then. What is interesting is the effect it can have on throughput and response time. Testing showed that once was enabled the test had 8% improvements in both throughput and response time. This is truly interesting data. That is a big difference and a good reason to periodically take thread dumps even if everything seems to be running nominally. One never knows when they may uncover a gem of a performance boost like this one.

Thursday, September 24, 2009

PMI = all is not a good setting for production

I am reminded occasionally when debugging production issues that setting the PMI level in WebSphere Application Server to "all" is not a good thing to do. At the "all" setting one can see an application exhibit negative behavior. I am also learning no two applications will always exhibit the same problem. Some applications crash and burn under the weight of PMI=all incapable of providing a response to any request. Other applications seem to continue to function nominally but the CPU for those processes seem to be considerably higher.

Finally, regardless of the PMI settings configured in the production environment it is imperative that all members of a cluster are set to the same values. Having some processes set to one level and other processes set to another results in higher CPU utilization from some JVMs and not others.

Wednesday, September 23, 2009

1 minute garbage collection cycles

Does your application use RMI? Are you seeing garbage collection cycles with exclusiveaccess every minute (60,000 ms)? You may need to apply the following two parameters to the JVM command line.

-Dsun.rmi.dgc.client.gcInterval=360000000 -Dsun.rmi.dgc.server.gcInterval=360000000

Tuesday, September 22, 2009

Essential log attributes to log - duration of a request in access.log

It is absolutely imperative that the access.log file from your Web server record the total response time as measured by the Web server. This is available in all flavours of Web servers. It provides definitive response time numbers as they leave the Web server that can be used as a second data point to the application monitor's servlet response time. The log entry also allows sysadmins a chance to put their log monitors on it to alert on long response times.

Default values - they're not for everyone

I am frequently asked about about the significance, or lack thereof, of default values to one or more configuration items. People need to remember that default values are simply starting points so the environment can be brought up. For example, an operating system has the default value for TCP Keep-Alives set to 2 hours. According to RFC 1122 this is an acceptable default value. However, if you look at the acceptable ranges of values it starts at 10 seconds and can be set as high as 10 days. So, obviously, the default value is not going to work for everyone. Some sites may need to set it low to around 10-15 seconds. Other sites might need a 2 or 3 day setting.

Additionally a comment was made in general about setting the value of TCP Keep-Alives can not be lower than 2 hours. I think it was a misreading of the RFC specification that reads "This interval MUST be configurable and MUST default to no less than two hours." Read that the DEFAULT must not be less than two hours. This does not imply that the value can not be lower than two hours. The lesson learned there is to read the specifications to the letter. Unfortunately I think the emphasis in the RFC on the word "MUST" does distract the reader from the word that follows "default" and can subtly mislead the reader. Erroneous information is consequently passed on as a rule. Unfortunately, if no one backtracks reads the RFC and verifies the rule then rampant disinformation is spread and becomes written in stone.

Trust but verify every setting in your environment. Every setting should be tested as thoroughly as possible. Documentation should then record what was tested and what values were selected over others and why. This documentation will be valuable to the group of people maintaining the application 5 years from now and I can guarantee it will not be you the reader. It will be whoever is watching the store after you have moved on to new and exciting career opportunities.

Wednesday, September 16, 2009

Enable verbose GC

I really recommend running in production with verbose GC enabled. The use of the term verbose is unfortunate as it is not as verbose as people think it is. And the data collected is invaluable.

If you have any doubts, enable verbose GC in test and see if you can measure any impact from enabling it.

socketread0 timeout on JDBC connections

Since I run into this problem every once in a while I'm putting up a note here to help others when they run into this problem. JDBC calls can sometimes get stuck on socket read calls to the database if some rather nasty network problems exist. The only way to determine if network problems exist is to use tcpdump (AIX, Linux) or snoop (Solaris) to capture the packets. One can then use Wireshark (or its predecessor Ethereal) to read the capture files. If you see issues like "unreassembled packets", "lost segments", "duplicate ACK" or checksum errors then most likely the network is having some abhorrent behaviour affecting the server. If random threads hang on socketRead0 calls that never seem to get a response then the only way to deal with this is through timeouts.

On DB2 follow use this parameter:

The amount of time in seconds before a connection socket read times out. This property applies only to IBM Data Server Driver for JDBC and SQLJ type 4 connectivity, and affects all requests that are sent to the data source after a connection is successfully established. The default is 0. A value of 0 means that there is no timeout.

For Oracle use: