Thursday, December 24, 2009

Catching up on closing up 2009

It has been a busy few weeks for me which is one reason I haven't posted in a while.  Some things I've been thinking about lately revolve around high availability, switches and logical partitioning of applications and application servers.  I'll try to post more on these subjects over the next few weeks. 

Hope you all have a great holiday season and a Happy New Year.  I'll see you on the other side of 2010!

Tuesday, November 10, 2009

The people side of performance

Those of you that read my blog know that I tend to primarily address technical topics when it comes to performance. The latest issue of Communications from the ACM (10/2009 vol 52 No 10) has an article called "The Business of Software Contagious Craziness, Spreading Sanity" by Philip G Armour. I was intrigued with some of the points Philip brought up in his article. Particularly this tidbit

"... having a strong conviction that something cannot be done is usually a self-fulfilling prophecy. If people are convinced that something is not achievable, then they usually won't achieve it - if we argue for our limitations, we get to keep them."

Wow! Nail on the head. So how do we convince an organization that something that is currently perceived as impossible actually is quite achievable?

First off, people have be encouraged to think outside the box. George Patton is quoted as saying "If everyone is thinking alike, someone isn't thinking." Coming from the battlefield I think we can understand why the General said this. He needed his commanders to not only visualize the current plan of operations but also what alternatives and/or competing strategies were available. One of my friends, Keys Botzum who also happens to be an IBM STSM, once put it this way. "The patient is dying on the table. What can we do, right now, to keep him from dying." It is a different perspective for techies to mull over.

Likewise, a technical team can not be lead by consensus. Technical teams have be be lead by dictatorship. Someone, at the top, needs to make technical decisions and take responsibility for those decisions. Right or wrong someone, like General Patton, has to tell the troops which way to charge. He can take input from his Lieutenants however the decision is ultimately his and his alone to ensure it is the correct decision. Which is why the other side, led by mad men who knew nothing about how to fight a war, lost. The General knew what he was doing and knew how to lead his people to victory.

Which brings us to our next topic. In order to be able to lead a technical team the leader has to be technical. I understand management's philosophy that they are managing people and not the technology. The problem is that technology, by its very nature, is complex. This is simply unavoidable. We're not managing a bunch of lawyers (sorry to pick on lawyers, you're used to it by now, as they are a good example). What lawyers do is not very hard to comprehend. They sue people. They defend people. They argue in court for their client. They draw up contracts, etc. None of this is rocket science. Start running multiple servers in parallel handling thousands of transactions per second and that organization is starting to approach NASA-like technical complexity issues. In order for a technical team to succeed they must be lead by a strong technical person. It is unreasonable to expect someone who does not comprehend the subtleties of technical subjects such as "verbose GC" or "JDBC deadlocks" to be capable of making a technical decision much less the correct one.

In conclusion, and back to the "self-fulling prophecy" and "that something is not achievable" that started this line of thinking for me. There are two ways to avoid failure. One is to have a strong technical leader who can take the reins as the technical General who makes the decisions. The other is for executive management to support the technical leaders. This means not only in terms of vocal agreement but also in terms of financial dollars to the project, making people available to do the tasks and any other strings that need to be pulled in order to make things happen.

I'll have more to say on this subject later as this is about all I can muster in the limited time I had for this "lunch & learn" session.

Thursday, November 5, 2009

Managing the application threads

Threads and multi-threaded programming is not a trivial subject. But some application developers find themselves having to start their own threads as part of their application.

I'll start this post off saying that using unmanaged threads is never a good thing. When using threads developers should be looking at how to manage all their threads in their application. Never exit the application without terminating all of your threads. Why would we need to do this?

An interesting problem is when an application is stopped (not the JVM/app-server but just the application), a new version is deployed, but the restart fails because there are still threads hanging around from the previous version of the applications and older versions of the same classes. I never would have imagined this problem occurring but I can see that if the application lost track of its threads that it would not be able to close them when it shut down.

Java provides the capability to manage all the application threads by using thread groups. When creating a thread and assigning the application's thread to a thread group all the application has to do is capture the "application stop event" and use the stop method of the thread group thereby stopping all the threads in the thread group.

The beauty of this solution is that the application can reliably manage all of the threads it creates though a single interface. It is also portable which means your application will work with any application server.

High availability through parallel cells

High availability sites have unique requirements that are not easy to solve. However, one solution does revolve around using multiple cells in production. This allows for easily taking a cell out of rotation to make changes. Then when the cell goes live and if the changes didn't work it is just as easy to take the cell out of rotation again. I've used this strategy at many, many 24 x 7 sites.

Here is the link to my latest article on multiple cells.

The only problem that arises is if things like database schemas change in the backend which also means application code changes. Those kinds of changes require more work and possibly parallel schemas until the transition is complete.

Monday, October 19, 2009

Running out of disk space

An event occurred this morning that reminds me that performance problems are sometimes environmentally driven. One application went down with various errors related to the file system. After some investigation it was discovered that the /tmp space ran out of space. Yet no one knew it had happened. Turns out that no monitors are in place to watch for resource exhaustion like low disk space.

One could argue why did /tmp fill out if daily (automated) housekeeping tasks were in place (they were not). But that is a pointless argument if someone moved a whole bunch of files in /tmp. They could have moved files in there to install them and then delete when the install is done. Or they could have been log files being transferred to another environment via ftp. In any case, an automated monitoring tool should have alerted the operators that disk space has come close to exhaustion and need to take action to remedy the situation.

If automated monitoring is not in place then the production applications that make use of /tmp space will fail causing an outage. And without alerts about low disk space it can take several hours for the operations and troubleshooting teams to figure out that the application failed because it could not write out a temporary file. A several hour outage that could have been completely avoided had the right resource monitors been in place.

Other resource monitors should be looking at CPU utilization, page file activity, etc.

Wednesday, September 30, 2009

IPv6

Last year I blogged about handling ipv6 lookups that caused threads to wait thus slowing them down. Not surprisingly I've come across this a few times since then. What is interesting is the effect it can have on throughput and response time. Testing showed that once -Djava.net.preferIPv4Stack=true was enabled the test had 8% improvements in both throughput and response time. This is truly interesting data. That is a big difference and a good reason to periodically take thread dumps even if everything seems to be running nominally. One never knows when they may uncover a gem of a performance boost like this one.

Thursday, September 24, 2009

PMI = all is not a good setting for production

I am reminded occasionally when debugging production issues that setting the PMI level in WebSphere Application Server to "all" is not a good thing to do. At the "all" setting one can see an application exhibit negative behavior. I am also learning no two applications will always exhibit the same problem. Some applications crash and burn under the weight of PMI=all incapable of providing a response to any request. Other applications seem to continue to function nominally but the CPU for those processes seem to be considerably higher.

Finally, regardless of the PMI settings configured in the production environment it is imperative that all members of a cluster are set to the same values. Having some processes set to one level and other processes set to another results in higher CPU utilization from some JVMs and not others.

Wednesday, September 23, 2009

1 minute garbage collection cycles

Does your application use RMI? Are you seeing garbage collection cycles with exclusiveaccess every minute (60,000 ms)? You may need to apply the following two parameters to the JVM command line.

-Dsun.rmi.dgc.client.gcInterval=360000000 -Dsun.rmi.dgc.server.gcInterval=360000000

Tuesday, September 22, 2009

Essential log attributes to log - duration of a request in access.log

It is absolutely imperative that the access.log file from your Web server record the total response time as measured by the Web server. This is available in all flavours of Web servers. It provides definitive response time numbers as they leave the Web server that can be used as a second data point to the application monitor's servlet response time. The log entry also allows sysadmins a chance to put their log monitors on it to alert on long response times.

Default values - they're not for everyone

I am frequently asked about about the significance, or lack thereof, of default values to one or more configuration items. People need to remember that default values are simply starting points so the environment can be brought up. For example, an operating system has the default value for TCP Keep-Alives set to 2 hours. According to RFC 1122 this is an acceptable default value. However, if you look at the acceptable ranges of values it starts at 10 seconds and can be set as high as 10 days. So, obviously, the default value is not going to work for everyone. Some sites may need to set it low to around 10-15 seconds. Other sites might need a 2 or 3 day setting.

Additionally a comment was made in general about setting the value of TCP Keep-Alives can not be lower than 2 hours. I think it was a misreading of the RFC specification that reads "This interval MUST be configurable and MUST default to no less than two hours." Read that the DEFAULT must not be less than two hours. This does not imply that the value can not be lower than two hours. The lesson learned there is to read the specifications to the letter. Unfortunately I think the emphasis in the RFC on the word "MUST" does distract the reader from the word that follows "default" and can subtly mislead the reader. Erroneous information is consequently passed on as a rule. Unfortunately, if no one backtracks reads the RFC and verifies the rule then rampant disinformation is spread and becomes written in stone.

Trust but verify every setting in your environment. Every setting should be tested as thoroughly as possible. Documentation should then record what was tested and what values were selected over others and why. This documentation will be valuable to the group of people maintaining the application 5 years from now and I can guarantee it will not be you the reader. It will be whoever is watching the store after you have moved on to new and exciting career opportunities.

Wednesday, September 16, 2009

Enable verbose GC

http://www-01.ibm.com/support/docview.wss?rs=180&uid=swg21114927

I really recommend running in production with verbose GC enabled. The use of the term verbose is unfortunate as it is not as verbose as people think it is. And the data collected is invaluable.

If you have any doubts, enable verbose GC in test and see if you can measure any impact from enabling it.

socketread0 timeout on JDBC connections

Since I run into this problem every once in a while I'm putting up a note here to help others when they run into this problem. JDBC calls can sometimes get stuck on socket read calls to the database if some rather nasty network problems exist. The only way to determine if network problems exist is to use tcpdump (AIX, Linux) or snoop (Solaris) to capture the packets. One can then use Wireshark (or its predecessor Ethereal) to read the capture files. If you see issues like "unreassembled packets", "lost segments", "duplicate ACK" or checksum errors then most likely the network is having some abhorrent behaviour affecting the server. If random threads hang on socketRead0 calls that never seem to get a response then the only way to deal with this is through timeouts.

On DB2 follow use this parameter:

http://publib.boulder.ibm.com/infocenter/db2luw/v9r7/index.jsp?topic=/com.ibm.db2.luw.apdv.java.doc/doc/r0052038.html

blockingReadConnectionTimeout
The amount of time in seconds before a connection socket read times out. This property applies only to IBM Data Server Driver for JDBC and SQLJ type 4 connectivity, and affects all requests that are sent to the data source after a connection is successfully established. The default is 0. A value of 0 means that there is no timeout.

For Oracle use:

oracle.jdbc.ReadTimeout

http://forums.oracle.com/forums/thread.jspa?messageID=2326985

Tuesday, August 18, 2009

Logs, log identification, log monitoring

Logs are an ultimate source of information to the health of an application. Logging data helps understand if something has gone wrong and what might have gone wrong. The problem is that there are two sides to logging.

Developers provide the first level which are the log statements themselves. I have always found that a major/minor log code (i.e. (error/warn/info/trace = major; minor = module/method/logic) to be the most straight forward way to provide log information. Identify the type of log message being recorded to a specific category.

For the operations side of the house they can monitor logs and based off of known major log codes they can decide if they need to take action. A major code that indicates an error can be immediately dispatched to a field technician. While other levels can be monitored or collected if in troubleshooting mode.

However, if an application development team does not provide any well defined codes for log messages and only provide free form text then the operations/runtime side has difficulty determining if a specific log entry is a problem or not. Unfortunately, most application development teams choose the free form text logging as their solution which makes runtime operations that much more difficult and, in many cases, confusing.

Tuesday, July 28, 2009

Keeping notes (AKA your operational run book)

This morning I was reminded about the importance of keeping notes about the problems I encounter and the solution that provided the fix. A co-worker of mine researching a problem came across my blog post through a google search. It seems their environment is suffering from the same symptoms. Had I not kept a note about that issue they would have most likely gone through the same level of effort I originally did to research and find the solution. Hopefully when they apply the fix it resolves their problem (I think it will) and they didn't have to spend several hours/days to get to resolution.

This brought to mind the importance of a run book. If you are not familiar with the term a run book is basically an operations manual. It spells out what needs to be done for various tasks. For example, if we need to deploy a new application into production the run book has a full set of instructions (a recipe if you will) on what to do. Likewise, when the trouble shooter debugs a problem they record in the run book what the problem was, the steps they took to determine root cause, the fixes they tried and which one(s) finally worked. This way should the problem reoccur and a different shift of people see the same problem they can refer to the run book and go through the same steps.

The run book does not have to only address technical details. It can also provide operational response instructions like how to run a war room,who needs to be involved, when various organizations/teams are engaged, what triggers an engagement, etc, etc. Every operational detail can be recorded in the run book.

Does your production environment have a run book?

Monday, July 20, 2009

Finding information (like tuning guides)

I frequently get queries about finding information on one topic or another (generally around performance or application monitoring). A co-worker in Australia was looking for some general WebSphere Web services tuning guides. IBM.com has a wealth of information on a variety of topics so I thought I would put together this post on how to search for information.

This particular google search is one I used to find information for Web services tuning.



You'll note the key in the search string is the beginning "site:ibm.com" which restricts the search to only those items found at ibm.com.

Latest article

I forgot to post when my dW article came out. This is a new series I'm going to write on defensive architectures. Part 2 is being reviewed by my colleagues right now.

Thursday, July 9, 2009

Problem Determination - High CPU strategies on Windows OS

I've been quiet the past couple of months but that is because I have an article coming out on deveoperWorks soon. I'll post a link to that when it is ready.

A colleague called me today to talk about a high CPU scenario and steps to take to try and resolve it. My first recommendation was to get thread dumps of the JVM when it hits the high CPU scenario. Then I found out the JVM is running on a Windows OS. Since it is not possible to execute a kill -3 on Windows and one has to use the script methodology with no CPU available it is tough to get a thread dump.

I suggested he collect thread dumps as the CPU starts to climb. This works if they can (a) predict when the CPU will start to climb and (b) it climbs slowly enough to be able to collect javacores along the trajectory. It sounds like the CPU can spike in a matter of seconds even at stead load volumes. Ugh.

My final thought is to reduce the thread pool max in half and rerun the test. Perhaps there are simply too many threads executing and nothing will prevent it from going 100% CPU. Though I have seen code that can drive even an 8-way to 100% with just a few threads. I'm waiting to hear back. But I think the latter strategy of reducing the thread pool max will be the best course of action for them. I expect an email in the morning.

Thursday, March 26, 2009

Predicting Capacity

I have had a number of calls around "We have an application, not built yet, but we want to know if we have enough capacity and/or will the application scale/perform in our environment."

My analogy to this is the following. Someone shows me the blueprint to a Formula 1 (or NASCAR) race car. They ask me "will this car win the race?"

While the car being proposed may have all the right components (wheels, engine, gearbox, etc) there are other unknown variables that determine if the car can win a race. Without extensive testing there's no way to know if all the components are assembled correctly nor if they are configured to work together in optimal manner (performance tuned).

Related to this is the ability of the car to both get off the starting line and finish the race (a prerequisite for winning). Again, there are a number of variables that impact this. First is some catastrophic failure in one of the components. Then there is the skill of the car's driver. Then the other drivers on the same racetrack and crashing into our car.

There is simply no way to predict if the car can win much less finish a race.

The same attitude has to be taken when planning your production IT environment. You can't predict behaviour. You can, on the other hand, conduct stringent system integration and performance testing in order to see if your application will (or will not) perform as expected in production.

This is why everyone should test as early in the development cycle as possible. I was working with one application that had been in production for several months with constant, recurring failures. There were fundamental application design problems in how the application was developed. It took us four more months to fix those design problems and get the final, new version of the application into production. Don't put off testing. Do it right. Do it early. Test early. Test often. That is the key to success in production.

Wednesday, February 18, 2009

Too many testers?

Wow.  I never thought I'd see a statement so outrageous.  Someone, in this case Microsoft, has too many testers?  Can this be true?  Is this why Windows has fix after fix every week because their testing is so good because they have enough testers that they can get rid of excess testers? 

I can't believe that is even remotely true.  Most organizations suffer from having not enough testers.  This is why applications deployed in production suffer.  Something had to be cut out of the test plan because they either couldn't test something or didn't have the time/people. 

A technology enterprise should not cut technical folks out of their organization. 

I, Cringely - Cringely on technology
"There are too many testers"


Thursday, January 29, 2009

When Performance Counts

Macys.com holiday sales soar 26%

A site that performs well helps your bottom line. Note in the article during the critical December 2008 month sales increased 39.1% over December 2007. Obviously several factors play into this kind of success. The site had to remain available so users could access it. The site had to perform in order for users to stay connected and not go somewhere else.

Tuesday, January 20, 2009

Golden Rule #1 - It is what you don't test that breaks in production

I'm putting together information for some folks around performance. So I'm coming up with a set of golden rules. My first one is:

It is what you don't test that breaks in production.

This isn't specific to WebSphere products either. This goes pretty much across the board.

I work with quite a few people on performance issues. There are two types of problems that could have easily been avoided had the testing been conducted properly.

(a) 80/20 rule

Some people live by an 80/20 rule. We'll test the 20% of the application that 80% of the users use. Um, what about the other 80% that goes untested? What if that brings down the site even if only a single user hits it? I'd rather keep the site up and test everything. Wouldn't you?

(b) Boundary value problems

Everything takes input. Not every application validates input. This leads to problems with applications running unbounded database queries because the filter wasn't filled out correctly. Or someone fills in all 250 rows on a page and crashes the site because the testers missed that point. Every use case has boundaries. Test the zero case, the in between case and then infinity. To infinity and beyond my friends! That is where we can take our sites if we do the testing right!

Wednesday, January 14, 2009

Winter Hibernation

I know. I've been lax on updating this site. Partly because of the online holiday shopping season (I'm always busy that time of the year). Partly because of winter (we got socked in with 5" of snow this morning).

But I will come out of my shell soon and start commenting on new things to cover for 2009. We have some interesting Process Server testing strategies to talk about too which I'm absolutely excited to share with you.

Today my focus is on getting my speaker proposals in for the IMPACT 2009 conference.