Wednesday, August 27, 2008

Disaster Recovery is not easy (just ask the FAA)

I commented on the flight plan problem yesterday and it is interesting to see that the FAA actually had a Disaster Recovery (DR) plan and running DR site in place. But it didn't work.

U.S. Airports Back to Normal After Computer Glitch - NYTimes.com
The other flight-plan facility in Salt Lake City had to handle the entire country when the Atlanta system failed but the backup system quickly overloaded, Brown said.
The interesting thing to point out here is that while the FAA had a DR plan in place it seems no one looked at the capacity of the backup site. Was this bad planning? Bad execution? I wonder, did they ever test their DR environment? It seems kind of pointless to have spent the time and effort to build out the DR environment to then have it fail when it was needed. A waste of tax payers dollars IMHO.

Unfortunately a DR site failing to accomplish what it was intended to do is more often the case than not. First off, building a DR environment is not easy. I know some pretty smart people that have gotten this wrong. Secondly, after the DR environment is built it must be tested and tested as if a real disaster has occurred. The best corollary I can think of is testing your backups. Does anyone ever restore a server to see if the backups are good? I once was working with an organization that had their servers crash due to a hardware failure. They went to recover the servers and only then discovered the backups were corrupt. It took them a few days to rebuild the servers so I got to go home.

On a tangent, seeing that this is a blog on WebSphere Application Server performance, the one mistake I have seen people make with DR from a WebSphere Application Server perspective is to have a cell cross data center boundaries (I'm sure I've blogged about this before but here it goes again). The reason this is a mistake is that networks are not reliable. TCP/IP is not guaranteed, reliable delivery. Any hiccup which can include dropped packets, static electricity or planetary alignment in the solar system that causes even a slight degradation or lag in the networks between the two data centers can reek complete havoc within the cell. And guess how hard it is to troubleshoot that problem? Yeah, tough, real tough. Thus, even when a disaster is not occurring strange "problems" can occur with the applications running in that cell that just can not be easily explained. And the more distance you put between the data centers the more likely that strange problems will occur.

Likewise, when a disaster occurs and half the cell disappears this alone can cause other problems. For one, the application servers left running in the other data center will be looking for their lost siblings and spending time, CPU cycles, RAM and network bandwidth searching. This too affects the applications running in this configuration.

Therefore, moral of the story is to never have the cell boundaries leave the data center. In fact, there are a number of reasons one should be running multiple cells within the same data center to better manage planned and unplanned maintenance. Particularly in high volume, business critical, high availability environments.

Oh, that, and hire a real good DR expert if you're planning on implementing DR. Nothing like having the DR plan fail. In the FAA's case there are no repercussions (i.e. fines imposed that cost them millions of dollars a minute). Granted, this probably did cost the airlines and the unfortunate passengers money but nothing the FAA will have to reimburse. For a lot of private enterprises there could be severe repercussions not just in terms of penalties/fines but customer loyalty, customer trust and how your enterprise is viewed as a reliable business partner going forward.

Network performance related issues: isolate the network

Over the years I've worked on a few problems revolving around the network itself. What I'd like to do is first throw down a gauntlet around performance testing environments and the #1 factor is to have an isolated network. This means that the only traffic going across the wire/routers/switches are only related to traffic in the performance test.

For example, one time we had spent the day troubleshooting functional problems in the application. We finally got a good build and set up to start our baseline performance test later that night. We kicked off the test around 20h00 and about 2 hours into the test our response times were tanking into the several seconds range. I could not find any problems on the application server or the Web server tier. A little digging around and I found out the load generators were placed in a location remote from the test environment. At around 22h00 network backups kicked off as it was their policy to backup their servers at night. Good idea to run backups at night but because our load generators were sharing the same network their response times went down the drain. We had to break off our test and we all left the office around 23h00 when it was clear that it would take several hours to relocate the load generators onto the same switch as the isolated performance test environment.

There is nothing more disappointing than gearing up for a test in the evening only to have it been a waste of time (both mine and the good folks I was working with) much less keeping everyone away from their families. The reason the load generators were remotely located was because managers did not give the local team the time they needed to move them. As it was, we had to move them and had to take the hit to the testing schedule to do it.

Moral of the story, make sure every piece of the performance test environment is isolated from the rest of the organizational network. Otherwise testing will be inconclusive.

Tuesday, August 26, 2008

Root cause analysis

It is vitally important when problems occur that the root cause is identified. If it isn't the problem will reappear again.

Though my guess is with the cutbacks in future flights we won't see this problem for a while.

FAA computer problems cause flight delays - CNN.com
The problem appeared similar to a June 8, 2007, computer glitch that caused severe flight delays and some cancellations along the East Coast.

Do you have the capacity for holiday shopping season 2008?

Ironic is probably not the right term. I believe there was a problem in the air once over Canada over a similar situation where the plane was running out of fuel because the guy tanking up the plane was using one unit of measure and the person ordering the fuel used another.

That probably isn't the case for Amtrak. But this story brought to mind a question... do you have enough fuel (capacity) for the upcoming holiday shopping season? It is the end of August. By my rough estimate we probably have about another 2 months before the 2008 online holiday shopping season begins. If you haven't already run the numbers on your capacity I would highly recommend at least reviewing them right now. Be sure you have the horsepower to let your end users have a pleasant shopping experience.

Amtrak train runs out of fuel, stranded 2 hours - USATODAY.com
It was the little engine that couldn't — because it was thirsty for fuel.
Otherwise you could find yourself having some pretty severe strain in the server environment.

Test out the different GC algorithms

The latest versions of the IBM JVM provide a number of different Garbage Collection (GC) algorithms. Since no one algorithm is always the best to use it is imperative that the performance test plan allows for testing that cycles through each of the GC algorithms, adjusts the parameters based on the verboseGC output and sees if the results help improve the application's performance.

For example, I have noticed on a number of engagements around Process Server that using the gencon (generational garbage collection) has a positive effect on the server's performance. Obviously tuning the nursery and tenured spaces is necessary based on each individual application from the verbose GC output. But the fact that Process Server always seems to run better with gencon it is one of the first things I like to test out.

On the flip side, there are some tools out there that dynamically create object classes. My understanding is these are used for transformations but what I don't get is why they need to be dynamically created. Most object transformations are from one static object model to another. Once the transform is generated to continually generate another duplicate object makes no sense. Anyhow, I digress... the point I'm trying to make is that when dynamically generating lots of object classes (oh, say about 10,000 per second!) this can create serious havoc with the GC algorithms. In this particular case of 10k/sec it actually ends up looking like a native heap leak when using gencon. Thus if you're doing something like this (and I hope you're not but that is a different posting) then you may have to look at one of the GC opt algorithms instead.


Monday, August 25, 2008

Operational stability

One aspect of performance problems occurs with runtime operations and the stability from a runtime perspective separate from application code or content changes. Some of the things to consider when building out a 24/7 site and the ability to maintain availability even when things take a turn for the worst. I assume you already have built out multiple cells for the deployment of your high availability environment. I prefer cells over clusters because you have more operational/runtime flexibility with cells than you do with clusters particularly when you are trying to apply maintenance (i.e. shut of the load balancing to the cell that is going to be updated).

1. Ability to make repeatable changes to the configuration. This involves scripting and testing the scripts (preferably not in production) to be sure they work as intended.

2. Identify what is the active configuration. This is important to understand which configuration is active such that the correct cells are taken out of rotation.

3. Make the scripts aware of the active configuration. One really doesn't want to have scripts making changes to the active configuration by mistake.

4. Back out. Depends on your requirements but being able to flip back to a previously working configuration as quickly as possible minimizing downtime.

Wednesday, August 20, 2008

fn:toLowerCase

Ran into a stange situation where starting up the app server resulted in an exception. It appears the error manifests itself thinking there is no function mapped to the name fn:toLowerCase. This started happening after fixpack 19 was applied over fixpack 11. Long and short of it is the solution was to clear out the tmp files so the JSPs were recompiled after applying fixpack 19.

00000042 ServletWrappe E   SRVE0068E: Could  <br />   not invoke the service() method on servlet /global/tiles/g  <br />   eturl.jsp. Exception thrown : javax.servlet.ServletException: No  <br />   function is mapped to the name "fn:toLowerCase"  <br />           at org.apache.jasper.runtime.PageContextImpl.handlePageException  <br />   (PageContextImpl.java:650)  <br />           at com.ibm._jsp._geturl._jspService(_geturl.java:192)  

Monday, August 18, 2008

Collect traces when having problems with the JDBC connection pool

The referenced technote below provides the instructions on how to collect JDBC traces when encountering certain JDBC connection pool problems.

0000006b FreePool E J2CA0045E: Connection not available while invoking method createOrWaitForConnection for resource jdbc/abcd.

If you see the above error message in your logs we really need to collect JDBC traces. The traces will tell us why we are seeing this error. Is it long running transactions? Not a high enough max? Connections not properly closed? Who knows... the trace knows. Collect the trace and analyze the problem or open a PMR. Don't just blindly increase the connection pool maximum without knowing why they are being increased.

IBM - Using Connection information in WebSphere trace files to troubleshoot J2CA0045E and J2CA0020E or connection wait time-out problems.
J2CA0045E and J2CA0020E errors can be caused by many problems. They are showing a time-out condition where a resource or a managed connection is not available to fulfill a connection request.

In this technote we will use connection information in WebSphere® trace files to troubleshoot J2CA0045E and J2CA0020E or connection wait time-out problems.

Saturday, August 16, 2008

3 days of not selling burgers

Wow, a 3 day outage at Netflix. I know this is a blog about performance (specifically around WebSphere Application Server) but I have a fascination with production outages because I normally work those kinds of problems. They always revolve around human error either by fumble fingering something or not executing (like not conducting performance testing).

But a three day outage is excessive. That must have been a real interesting problem because I've never seen an outage that long (at least not after I have arrived to fix it). And it'll cost Netflix an estimated $6 million.

As the article states, imagine if McDonald's couldn't see burgers for 3 days. BTW, the McD's in Dwight, IL off I-55 lost their cash registers one day a couple of weekends ago when I stopped in to get some food on my long drive to Chicago. It was interesting to see how dependent McD's is on that register system because it drives their whole operations and what the human equivalent (i.e. yelling orders into the back area) and having to use calculators and paper/pen to record how many of each item was sold. Needless to say the credit card readers were down too so if you didn't have cash you were going hungry that day.

Lessons From Netflixs Fail Week - Bits - Technology - New York Times Blog
Netflix, the DVD-by-mail service, largely ceased shipping DVDs to its 8.4 million subscribers for three days this week. The company vaguely blames a technology glitch.

Wednesday, August 13, 2008

One reason I like to take thread dumps during performance testing

I've written before about thread dumps and the value of taking them during a performance test. The other day we finished our baseline testing so we started a duplicate test simply for taking thread dumps. Even though I wasn't seeing any anomalies in the baseline this is just something I do to make sure I dot all my i's and cross my t's. Plus, if there are any anomalies they will show up in the thread dump.

Lo and behold in the thread dumps (remember we take at least 3 thread dumps spread at least 2 minutes apart) I found a number of threads sitting on

at java/net/Inet6AddressImpl.lookupAllHostAddr(Native Method)

which seemed odd to me. I live by the rules of mathematics and its definition of randomness. A random thread dump at any random point in time should result in the threads doing random different things in each thread dump. If one thread dump in the series shows a couple of threads doing the same thing then that is odd. If more than one thread dump in the series shows more than one thread doing the same thing then we have a bottleneck! Bottlenecks can limit an application's ability to use CPU and keep the throughput down. If you can't fix the bottleneck then you'll need more hardware to scale up which means spending more money. If you can afford that then stop here and call your finance guy.

I searched the PMR database and found that indeed there is an interesting side effect to IPv6 and it was affecting the throughput of the application I was testing! Fortunately the PMR referenced a technote on the subject and I'm hoping we can eliminate this issue. The good thing that will come out of this is we will see a throughput improvement in the application once we apply the proper configuration. The improved throughput will mean an immediate cost savings in additional hardware we would have had to purchase to make up for the differential. A win win for everyone (well, except for the IBM hardware sales folks but c'est la vie!).

Although in this particular multi-tiered environment (Process Server talking to WebSphere Application Server talking to CICS) I still have to go back and collect thread dumps on Process Server once I'm satisfied I am not seeing any other issues in the WAS tier. Who knows what anomalies that will uncover (hopefully none so this testing can wrap up soon).

IBM - HostName lookup causes a JVM hang or slow response
If the DNS is not setup to handle IPv6 queries properly, the application must wait for the IPv6 query to time out.

Test early - Test often

There is nothing I can stress more (and I'm sure I've done it before but I'll do it again) than to start performance testing with the first build of any application. If one thing that 20-30 years of computer science has taught us, particularly in the past decade of the .com boom, is that performance testing can not be just the last few weeks of an application lifecycle.

Performance testing is where the application's warts and ugliness are exposed. It takes time to troubleshoot and solve each problem as they are encountered. A lot of the problems will entail code changes that then have to be re-tested functionally. I've been on some performance engagements that have lasted months due to the sheer number of application code defects that had to be corrected.

Managers have to understand that software will always have bugs and definitely will have performance issues. This is a fact of life. No single person has the brain power to take Java code (or any other language of choice) compile it and load test it in their head. So do the right thing. Take each application build (or weekly release candidate if you do daily builds) and start performance testing from the start. This way as each week progresses you'll be able to determine if the application is meeting expected performance, non-functional requirements objectives. You'll also be able to see if the application is improving or degrading each week.

Finally, with continuous performance testing there won't be any surprises in the last few weeks leading up to the "go live" date in production.

Monday, August 11, 2008

JVM tuning - too little heap

I'm working with an application this week that has been given a max heap of 256M. This is causing GC to occur every 250-400ms which consumes about 35-40ms of time each GC. In addition, the tenured space is down to 7% free. Thinks we need a high maximum JVM heap?

Thursday, August 7, 2008

More outages making the news

I noticed yesterday that American Airlines went down. There was discussion about it at www.flyertalk.com and I then found this article about a Google outage. This just goes to show folks how hard it really is to provide reliable Internet services to the masses. While one can only speculate how either of these outages occurred it is clear that even the big guys have a hard time doing a hard thing.

Google Gmail, Google Apps Outage in the Cloud
The search company spends billions of dollars on servers that can support the services its millions of Gmail and Apps users require;