Saturday, November 8, 2008

Less is more - shared resources and MAX sizings

Less is more. I don't know who to originally credit for that saying and when it comes to shared resource (i.e. thread and JDBC connection) pools this is the motto to live by. Need more than 40 threads to drive 100% CPU then the performance expert needs to look at where the bottleneck is in the application. Think the application needs more than 10 connections to a database? Think again. Too often I come across situations where the application is pegging the CPU and/or the database is so overwhelmed in production we can't bring it down to failover. Next time the team wants to change parameters to shared resource pools encourage them to make them smaller. They'll look at you as if you're crazy and think you'll be the next one to lose your job ... but you won't when they find your solution was actually the correct way to go.

Speaking of which, Alex has roadmaps he didn't work on this past week (well, I did have eye surgery and a hard disk crash both within days of each other).

Thursday, November 6, 2008

Why Alex needs a backup

Because Alex's hard disk failed. Alex actually does have a back up of his data but not all the applications so it is going to be quite a bit of effort to get a C drive with what I need on it. That said, I'll start with a fresh Windows XP image and may have lots better performance now!

Tuesday, November 4, 2008

Disaster Recovery - why you need a plan and execute on it

Problems happen.  Fiber cuts can occur out in the wild.  Hardware fails.  Routers go down.  

The last thing you want is to have no backup.  Put the backup in a separate data center facility on separate ISP inputs.  Have multiple electrical feeds into each data center.  

Otherwise you will have a site outage and there will be nothing that can be done to fix it because you are waiting on someone else who you have no control over to fix the problem.  

Tuesday, October 28, 2008

Save those (test) results and logs

Organizations typically have well defined archive policies for the production logs and data.  However, a lot of test environments are ignored.  There is nothing more frustrating than trying to look at test results from a couple of weeks ago and the logs are gone.  Thus any sort of analysis (i.e. pulling the verbose GC data from native_stderr) can not be conducted.  That means the test was inconclusive. 

If your organization does not have archival procedures and processes for the test environments (at least the performance one) make sure that it gets on someone's agenda.

Friday, October 24, 2008

Coding for resiliency : long running (or infinite) loops : stop the madness

....
while(exception instance of MyException) {
    exception = (MyException)exception;
}
....

This was code (not verbatim) that was found in a production application.  Please, always have a valid exit condition for your loops.  And then a 2nd way to exit the loop somehow if the primary condition always reverts to true.


Coding for Resiliency : thread locals and cleaning them up in a finally block

I haven't updated this site because I had a brief vacation that was cut short by a blocked tear duct.  I'm still suffering with this and the medication seems to have alleviated but not rectified the problem.  Unfortunately that means going to an eye doctor on Monday if it hasn't cleared.

A colleague of mine is working on a problem where they plan to use a thread local within a servlet filter.  When I read this it reminded me of another problem; uncleaned thread local variables.  The key here is to be sure to clear out the thread local in a finally block before exiting the servlet filter.  This is because it is very likely that a thread in WebSphere Application Server will be reused by subsequent requests (especially in high volume environments).  If the thread local is not properly cleaned up during an exception situation then the next request on the same thread will be looking at a thread local in an invalid state.  So use those finally blocks for what they were purposed for!

Thursday, October 9, 2008

HTTP Session failover and performance

One of my colleagues wrote a good article that covers HTTP session failover.  There are performance implications to using HTTP sessions and persisting them.  This spurs me to start a series of articles to discuss some of the pros/cons with these mechanisms. 

Before session persistence existed there was no fail over capability for sessions.  This is, though, probably the most performant deployment.  Sure, users have to log back in but the application server doesn't have to spend time on serializing objects and then pushing them out to a database.  You save on both memory (for the serialized objects) which then have to be garbage collected which is paid for through CPU utilization. 

If your business requirements can stand it, and require users to relogin in a failure, then this is a good deployment strategy.  I know a number of intra departmental applications that can tolerate such activity. 

The WebSphere Contrarian: Back to basics: Session failover
My preferred alternative is to rely not on session distribution, but instead to rely simply on HTTP server plug-in affinity to “pin” a user to an application server, although this does mean that stopping an application server JVM will result in the loss of the HttpSession object. The benefit of doing so is that there is no need to distribute the session objects to provide for HttpSession object failover when an application server fails or is stopped. The obvious down side is that a user will lose any application state and will need to log back in and recreate it, and this may or may not be acceptable for your application or business requirements. I'll mention that I’ve worked with a number of customers that in fact agree with this view and make this their standard practice.

Wednesday, October 8, 2008

pClusters on JDK 1.4.2

When enabling pClusters on the 1.4.2 JDK be sure to enable gcpolicy:subpool otherwise it won't work as expected.

Thursday, September 25, 2008

Redundancy and unreliable software

Redundancy is often a great thing to have and has save many a production environments. However, redundancy works if the underlying hardware and software work. Making an unreliable environment redundant simply means that the redundant system is unreliable too. Failing over from one failing environment to another that will fail is not going to help availability.

Therefore, the best thing to do is to make sure that the system being copied is reliable in and of itself.

Thursday, September 18, 2008

Deploying with Resiliency - shared libraries testing

I've been talking about Coding for Resiliency but figured I should do a little focusing on Deploying with Resiliency today.

Anytime one pushes out a new release candidate it should be tested in the same configuration as it will be in the production environment.  This means that any shared libraries must be deployed out with the applications that are using them when multiple applications share the same app server.  

The best that can happen is that the production environment will remain stable as long as what was tested is what is in production.  However, the worse that can happen (and it will) is the site will become unstable.  That is not the end of it though.  Whatever deployment was put in place has to be backed out (here is where multiple cells really helps you out and if you don't know what I'm talking about go to this great article by my colleague Peter) and the previous deployment has to be re-established in production.  The decision to revert also has to be made.  This is, depending on the people running production, a difficult decision to make.  Even more difficult if a back out cell is not available. 

Therefore, depending on when it is discovered that the production site is unstable and the decision to back out is finalized it can be several hours before production can be stabilized.  

This is one reason why it is important to test prior to deployment in production and that "testing in production" is never a good idea.  

Friday, September 12, 2008

Code for resiliency - validate input

Scenario 1
I was speaking with one of my colleagues about problems they were seeing with an application. It seems that if the user typed alphabetic characters where a number was expected the application choked, died and rolled over.

Solution
I was amazed. "Don't they validate their input?" I asked. Depending on your view point surprisingly they didn't.

Always, always validate user input. If expecting a name make sure it doesn't contain characters you don't normally see in names (i.e. Joe123). If expecting a zipcode make sure it is formatted properly. But never just take the input from the user and pass it on to the next layer.

In addition, validating input will help prevent XSS attacks.

Scenario 2
You're running a B2x Web service and have complex XML documents. But modern day parsers are vulnerable to malformed XML that can contain circular references.

Solution
Acquire DataPower and have it front your Web services. DataPower has incredibly good XML validation and is able to prevent DoS attacks from malformed XML.

Thursday, September 11, 2008

This one keeps on going and going

I blogged about this problem a year and a half ago.  It continues to live on.  If you are writing non-batch applications then set this setting to unshareable.  Yes, it is unfortunate the default is shareable but we need to deal with it.  This descriptor seems to move around so look for it either in the JDBC or thread section of the descriptors. 

Alexandre Polozoff on WebSphere Performance: Day One SHAREABLE JDBC connection setting
change the descriptor to UNSHAREABLE

Code for resiliency - circuit breakers in loops

Scenario
One common problem I run into on different critsits are run away applications. You know the type. The application is sitting there in a tight loop allocating 1 or 2Megs of data on each iteration chewing up CPU and causing the garbage collector fits. Sometimes the code is buggy. Other times the data has changed and has pointers to larger data sets than it should. A few times a new use case was discovered.

Runaway loops in an online, web application are a bad thing. If enough users go down the same use case the application can have several threads all running away. The CPU will be pegged at or close to 100% and the verbose GC data shows tons of memory being allocated at very fast rates because the garbage collector is busy collecting it!

As more users go down the same use case more and more threads get hung up on different application servers in the farm. Pretty soon, you're recycling application servers as it is the only administrative solution for this problem. The problem with recycling is you are also dumping users who have good transactions in progress. Because the code has run away there are no other options. Hopefully those good transactions will try their transaction again when it fails and not go shopping somewhere else.

Solution
Pick some arbitrary count limit as a circuit breaker. Once the count limit is reached the loop exits. In addition, when the circuit breaker is activated the application log that this occurred and dump some of the relevant data around that point. This way when a problem occurs someone can be alerted that the circuit breaker was activated and examine the problem.

I would implement the loop exit by throwing an exception (this is an exceptional case right?) from an if statement at the beginning of the loop. Let the exception handler dump the data to the log. Make sure something is monitoring the log for the exception and that someone is alerted to examine the problem.

Resiliency in code is not difficult. But you do have to anticipate the worst will happen. Every loop has to have a way to abort if it has executed an unreasonable amount of times. Otherwise production can be very unstable.

Unreasonable Arguments Opposed
While I tend to think I'm fairly even handed a number of people have told me this is a stupid idea. Well, let me tell you why it isn't. If the circuit breaker is hit then you know you are already in an invalid state. Regardless of the reason continuing to process the request is pointless because something is not right. If we only have 800 items in a product catalog then looping a 1000 times means something is wrong. Therefore there has to be a way to abort the request before we degrade production. As for that user's request there is no way they will be able to complete it. However, since the circuit breaker was activated and data was logged and someone was alerted not only do we know that something happened but someone is not actively working the problem to figure out why the breaker was activated.

What about transaction integrity people ask? Well, if you handle exceptions properly (and that'll be another CfR article) then you won't have any problems with transaction boundaries and everything will be rolled back appropriately.

What about open locks? Again, if you have coded your exception handling properly then all locks should be released.

What about open JDBC connections? Again, if you have coded your exception handling properly then all JDBC connections should be released in the finally block, no?

Monday, September 8, 2008

5 9s is not easy, it can be done but you have to know what you're doing

Continuing the coverage of production outages that make the news it seems that the London Stock Exchange had a serious outage today. Most likely due to the high volume resulting from the US Government takeover of Freddie & Fannie.

We at IBM often put together high volume environments that have high availability requirements. In order to do this, and do it well, one has to make sure they have built in resiliency, enough capacity and then disaster recovery for business continuity. I've worked with a number of household name companies, world wide, on providing just such capabilities. It is disastrous when a e-commerce retailer is unable to sell product during the holiday shopping season. Things can get particularly bad for financial institutions when money is on the line.

I can't say what they did or didn't do but it certainly seems like people want answers. Reassurances are going to be hard to come by until they do a lot more ground work.

London Stock Exchange crippled by system outage | Reuters
The exchange would not say whether volume was the issue and declined to give details on what had caused the problem. But angry customers were demanding an explanation.

"We want answers as to how this happened in the first place and reassurances that it will not happened again," said Angus Rigby, chief executive of brokerage TD Waterhouse.

Friday, September 5, 2008

debug logging has no place in production

You know, I'm often confronted with this problem and I have yet to really understand why it exists. Lots of people are running applications in production and they have logging set to the debug level. From a performance perspective this is intolerably horrible! You're constantly hitting disk, garbage collecting spurious strings, serializing around the disk access that it just makes no sense to me how I keep getting javacores with stack traces in logger.debug (or worse, SystemOut.println!)... speaking of println, one should never, ever use println for logging. There is no way to control it like a logger. With JDK v1.4 (I believe, maybe it is Java 5) there is a logger in the JDK. Use it! And set your log level to WARN or ERROR but nothing more granular than that unless you're debugging a problem.

Thursday, September 4, 2008

tprof is your friend

I've collected data on various 100% CPU problems and always am amazed how useful tprof data is yet it is very unintrusive.

IBM - MustGather: 100% CPU usage on AIX
Collecting data for 100% CPU usage problems with IBM® WebSphere® Application Server on the AIX® operating system. Gathering this information before calling IBM support will help familiarize you with the troubleshooting process and save you time.

Wednesday, August 27, 2008

Disaster Recovery is not easy (just ask the FAA)

I commented on the flight plan problem yesterday and it is interesting to see that the FAA actually had a Disaster Recovery (DR) plan and running DR site in place. But it didn't work.

U.S. Airports Back to Normal After Computer Glitch - NYTimes.com
The other flight-plan facility in Salt Lake City had to handle the entire country when the Atlanta system failed but the backup system quickly overloaded, Brown said.
The interesting thing to point out here is that while the FAA had a DR plan in place it seems no one looked at the capacity of the backup site. Was this bad planning? Bad execution? I wonder, did they ever test their DR environment? It seems kind of pointless to have spent the time and effort to build out the DR environment to then have it fail when it was needed. A waste of tax payers dollars IMHO.

Unfortunately a DR site failing to accomplish what it was intended to do is more often the case than not. First off, building a DR environment is not easy. I know some pretty smart people that have gotten this wrong. Secondly, after the DR environment is built it must be tested and tested as if a real disaster has occurred. The best corollary I can think of is testing your backups. Does anyone ever restore a server to see if the backups are good? I once was working with an organization that had their servers crash due to a hardware failure. They went to recover the servers and only then discovered the backups were corrupt. It took them a few days to rebuild the servers so I got to go home.

On a tangent, seeing that this is a blog on WebSphere Application Server performance, the one mistake I have seen people make with DR from a WebSphere Application Server perspective is to have a cell cross data center boundaries (I'm sure I've blogged about this before but here it goes again). The reason this is a mistake is that networks are not reliable. TCP/IP is not guaranteed, reliable delivery. Any hiccup which can include dropped packets, static electricity or planetary alignment in the solar system that causes even a slight degradation or lag in the networks between the two data centers can reek complete havoc within the cell. And guess how hard it is to troubleshoot that problem? Yeah, tough, real tough. Thus, even when a disaster is not occurring strange "problems" can occur with the applications running in that cell that just can not be easily explained. And the more distance you put between the data centers the more likely that strange problems will occur.

Likewise, when a disaster occurs and half the cell disappears this alone can cause other problems. For one, the application servers left running in the other data center will be looking for their lost siblings and spending time, CPU cycles, RAM and network bandwidth searching. This too affects the applications running in this configuration.

Therefore, moral of the story is to never have the cell boundaries leave the data center. In fact, there are a number of reasons one should be running multiple cells within the same data center to better manage planned and unplanned maintenance. Particularly in high volume, business critical, high availability environments.

Oh, that, and hire a real good DR expert if you're planning on implementing DR. Nothing like having the DR plan fail. In the FAA's case there are no repercussions (i.e. fines imposed that cost them millions of dollars a minute). Granted, this probably did cost the airlines and the unfortunate passengers money but nothing the FAA will have to reimburse. For a lot of private enterprises there could be severe repercussions not just in terms of penalties/fines but customer loyalty, customer trust and how your enterprise is viewed as a reliable business partner going forward.

Network performance related issues: isolate the network

Over the years I've worked on a few problems revolving around the network itself. What I'd like to do is first throw down a gauntlet around performance testing environments and the #1 factor is to have an isolated network. This means that the only traffic going across the wire/routers/switches are only related to traffic in the performance test.

For example, one time we had spent the day troubleshooting functional problems in the application. We finally got a good build and set up to start our baseline performance test later that night. We kicked off the test around 20h00 and about 2 hours into the test our response times were tanking into the several seconds range. I could not find any problems on the application server or the Web server tier. A little digging around and I found out the load generators were placed in a location remote from the test environment. At around 22h00 network backups kicked off as it was their policy to backup their servers at night. Good idea to run backups at night but because our load generators were sharing the same network their response times went down the drain. We had to break off our test and we all left the office around 23h00 when it was clear that it would take several hours to relocate the load generators onto the same switch as the isolated performance test environment.

There is nothing more disappointing than gearing up for a test in the evening only to have it been a waste of time (both mine and the good folks I was working with) much less keeping everyone away from their families. The reason the load generators were remotely located was because managers did not give the local team the time they needed to move them. As it was, we had to move them and had to take the hit to the testing schedule to do it.

Moral of the story, make sure every piece of the performance test environment is isolated from the rest of the organizational network. Otherwise testing will be inconclusive.

Tuesday, August 26, 2008

Root cause analysis

It is vitally important when problems occur that the root cause is identified. If it isn't the problem will reappear again.

Though my guess is with the cutbacks in future flights we won't see this problem for a while.

FAA computer problems cause flight delays - CNN.com
The problem appeared similar to a June 8, 2007, computer glitch that caused severe flight delays and some cancellations along the East Coast.

Do you have the capacity for holiday shopping season 2008?

Ironic is probably not the right term. I believe there was a problem in the air once over Canada over a similar situation where the plane was running out of fuel because the guy tanking up the plane was using one unit of measure and the person ordering the fuel used another.

That probably isn't the case for Amtrak. But this story brought to mind a question... do you have enough fuel (capacity) for the upcoming holiday shopping season? It is the end of August. By my rough estimate we probably have about another 2 months before the 2008 online holiday shopping season begins. If you haven't already run the numbers on your capacity I would highly recommend at least reviewing them right now. Be sure you have the horsepower to let your end users have a pleasant shopping experience.

Amtrak train runs out of fuel, stranded 2 hours - USATODAY.com
It was the little engine that couldn't — because it was thirsty for fuel.
Otherwise you could find yourself having some pretty severe strain in the server environment.

Test out the different GC algorithms

The latest versions of the IBM JVM provide a number of different Garbage Collection (GC) algorithms. Since no one algorithm is always the best to use it is imperative that the performance test plan allows for testing that cycles through each of the GC algorithms, adjusts the parameters based on the verboseGC output and sees if the results help improve the application's performance.

For example, I have noticed on a number of engagements around Process Server that using the gencon (generational garbage collection) has a positive effect on the server's performance. Obviously tuning the nursery and tenured spaces is necessary based on each individual application from the verbose GC output. But the fact that Process Server always seems to run better with gencon it is one of the first things I like to test out.

On the flip side, there are some tools out there that dynamically create object classes. My understanding is these are used for transformations but what I don't get is why they need to be dynamically created. Most object transformations are from one static object model to another. Once the transform is generated to continually generate another duplicate object makes no sense. Anyhow, I digress... the point I'm trying to make is that when dynamically generating lots of object classes (oh, say about 10,000 per second!) this can create serious havoc with the GC algorithms. In this particular case of 10k/sec it actually ends up looking like a native heap leak when using gencon. Thus if you're doing something like this (and I hope you're not but that is a different posting) then you may have to look at one of the GC opt algorithms instead.


Monday, August 25, 2008

Operational stability

One aspect of performance problems occurs with runtime operations and the stability from a runtime perspective separate from application code or content changes. Some of the things to consider when building out a 24/7 site and the ability to maintain availability even when things take a turn for the worst. I assume you already have built out multiple cells for the deployment of your high availability environment. I prefer cells over clusters because you have more operational/runtime flexibility with cells than you do with clusters particularly when you are trying to apply maintenance (i.e. shut of the load balancing to the cell that is going to be updated).

1. Ability to make repeatable changes to the configuration. This involves scripting and testing the scripts (preferably not in production) to be sure they work as intended.

2. Identify what is the active configuration. This is important to understand which configuration is active such that the correct cells are taken out of rotation.

3. Make the scripts aware of the active configuration. One really doesn't want to have scripts making changes to the active configuration by mistake.

4. Back out. Depends on your requirements but being able to flip back to a previously working configuration as quickly as possible minimizing downtime.

Wednesday, August 20, 2008

fn:toLowerCase

Ran into a stange situation where starting up the app server resulted in an exception. It appears the error manifests itself thinking there is no function mapped to the name fn:toLowerCase. This started happening after fixpack 19 was applied over fixpack 11. Long and short of it is the solution was to clear out the tmp files so the JSPs were recompiled after applying fixpack 19.

00000042 ServletWrappe E   SRVE0068E: Could  <br />   not invoke the service() method on servlet /global/tiles/g  <br />   eturl.jsp. Exception thrown : javax.servlet.ServletException: No  <br />   function is mapped to the name "fn:toLowerCase"  <br />           at org.apache.jasper.runtime.PageContextImpl.handlePageException  <br />   (PageContextImpl.java:650)  <br />           at com.ibm._jsp._geturl._jspService(_geturl.java:192)  

Monday, August 18, 2008

Collect traces when having problems with the JDBC connection pool

The referenced technote below provides the instructions on how to collect JDBC traces when encountering certain JDBC connection pool problems.

0000006b FreePool E J2CA0045E: Connection not available while invoking method createOrWaitForConnection for resource jdbc/abcd.

If you see the above error message in your logs we really need to collect JDBC traces. The traces will tell us why we are seeing this error. Is it long running transactions? Not a high enough max? Connections not properly closed? Who knows... the trace knows. Collect the trace and analyze the problem or open a PMR. Don't just blindly increase the connection pool maximum without knowing why they are being increased.

IBM - Using Connection information in WebSphere trace files to troubleshoot J2CA0045E and J2CA0020E or connection wait time-out problems.
J2CA0045E and J2CA0020E errors can be caused by many problems. They are showing a time-out condition where a resource or a managed connection is not available to fulfill a connection request.

In this technote we will use connection information in WebSphere® trace files to troubleshoot J2CA0045E and J2CA0020E or connection wait time-out problems.

Saturday, August 16, 2008

3 days of not selling burgers

Wow, a 3 day outage at Netflix. I know this is a blog about performance (specifically around WebSphere Application Server) but I have a fascination with production outages because I normally work those kinds of problems. They always revolve around human error either by fumble fingering something or not executing (like not conducting performance testing).

But a three day outage is excessive. That must have been a real interesting problem because I've never seen an outage that long (at least not after I have arrived to fix it). And it'll cost Netflix an estimated $6 million.

As the article states, imagine if McDonald's couldn't see burgers for 3 days. BTW, the McD's in Dwight, IL off I-55 lost their cash registers one day a couple of weekends ago when I stopped in to get some food on my long drive to Chicago. It was interesting to see how dependent McD's is on that register system because it drives their whole operations and what the human equivalent (i.e. yelling orders into the back area) and having to use calculators and paper/pen to record how many of each item was sold. Needless to say the credit card readers were down too so if you didn't have cash you were going hungry that day.

Lessons From Netflixs Fail Week - Bits - Technology - New York Times Blog
Netflix, the DVD-by-mail service, largely ceased shipping DVDs to its 8.4 million subscribers for three days this week. The company vaguely blames a technology glitch.

Wednesday, August 13, 2008

One reason I like to take thread dumps during performance testing

I've written before about thread dumps and the value of taking them during a performance test. The other day we finished our baseline testing so we started a duplicate test simply for taking thread dumps. Even though I wasn't seeing any anomalies in the baseline this is just something I do to make sure I dot all my i's and cross my t's. Plus, if there are any anomalies they will show up in the thread dump.

Lo and behold in the thread dumps (remember we take at least 3 thread dumps spread at least 2 minutes apart) I found a number of threads sitting on

at java/net/Inet6AddressImpl.lookupAllHostAddr(Native Method)

which seemed odd to me. I live by the rules of mathematics and its definition of randomness. A random thread dump at any random point in time should result in the threads doing random different things in each thread dump. If one thread dump in the series shows a couple of threads doing the same thing then that is odd. If more than one thread dump in the series shows more than one thread doing the same thing then we have a bottleneck! Bottlenecks can limit an application's ability to use CPU and keep the throughput down. If you can't fix the bottleneck then you'll need more hardware to scale up which means spending more money. If you can afford that then stop here and call your finance guy.

I searched the PMR database and found that indeed there is an interesting side effect to IPv6 and it was affecting the throughput of the application I was testing! Fortunately the PMR referenced a technote on the subject and I'm hoping we can eliminate this issue. The good thing that will come out of this is we will see a throughput improvement in the application once we apply the proper configuration. The improved throughput will mean an immediate cost savings in additional hardware we would have had to purchase to make up for the differential. A win win for everyone (well, except for the IBM hardware sales folks but c'est la vie!).

Although in this particular multi-tiered environment (Process Server talking to WebSphere Application Server talking to CICS) I still have to go back and collect thread dumps on Process Server once I'm satisfied I am not seeing any other issues in the WAS tier. Who knows what anomalies that will uncover (hopefully none so this testing can wrap up soon).

IBM - HostName lookup causes a JVM hang or slow response
If the DNS is not setup to handle IPv6 queries properly, the application must wait for the IPv6 query to time out.

Test early - Test often

There is nothing I can stress more (and I'm sure I've done it before but I'll do it again) than to start performance testing with the first build of any application. If one thing that 20-30 years of computer science has taught us, particularly in the past decade of the .com boom, is that performance testing can not be just the last few weeks of an application lifecycle.

Performance testing is where the application's warts and ugliness are exposed. It takes time to troubleshoot and solve each problem as they are encountered. A lot of the problems will entail code changes that then have to be re-tested functionally. I've been on some performance engagements that have lasted months due to the sheer number of application code defects that had to be corrected.

Managers have to understand that software will always have bugs and definitely will have performance issues. This is a fact of life. No single person has the brain power to take Java code (or any other language of choice) compile it and load test it in their head. So do the right thing. Take each application build (or weekly release candidate if you do daily builds) and start performance testing from the start. This way as each week progresses you'll be able to determine if the application is meeting expected performance, non-functional requirements objectives. You'll also be able to see if the application is improving or degrading each week.

Finally, with continuous performance testing there won't be any surprises in the last few weeks leading up to the "go live" date in production.

Monday, August 11, 2008

JVM tuning - too little heap

I'm working with an application this week that has been given a max heap of 256M. This is causing GC to occur every 250-400ms which consumes about 35-40ms of time each GC. In addition, the tenured space is down to 7% free. Thinks we need a high maximum JVM heap?

Thursday, August 7, 2008

More outages making the news

I noticed yesterday that American Airlines went down. There was discussion about it at www.flyertalk.com and I then found this article about a Google outage. This just goes to show folks how hard it really is to provide reliable Internet services to the masses. While one can only speculate how either of these outages occurred it is clear that even the big guys have a hard time doing a hard thing.

Google Gmail, Google Apps Outage in the Cloud
The search company spends billions of dollars on servers that can support the services its millions of Gmail and Apps users require;

Thursday, July 31, 2008

Messaging engine and its database

Some optimizations are required on database tables. My good friend Tom Alcott discusses the lack of necessity to optimize the messaging engine database tables and for good reason.

The WebSphere Contrarian: Are you sure you want to reorg that messaging engine database?
The standard practice for database administration is to periodically check on the database and table organization to insure optimal performance -- but do these standard practices apply to a database used for JMS persistent message storage with IBM® WebSphere® Application Server?

New book on DataPower

If you are not aware of the performance boost DataPower can provide your environment and the XML threat protections it provides you really need to get up to speed on it. Some of my colleagues have taken the time to put this book together.

Amazon.com: IBM WebSphere DataPower SOA Appliance Handbook: Bill Hines, John Rasmussen, Jaime Ryan, Simon Kapadia, Jim Brennan: Books
IBM WebSphere DataPower SOA Appliance Handbook (Hardcover)

Wednesday, July 23, 2008

"free -m" on the Linux command line

One thing you can never do is over commit available RAM in the machine. If the application server fails to start and SystemOut.log contains a message like this:

JVMJ9VM015W Initialization error for library j9gc23(2): Failed to instantiate heap. 1G requested
Could not create the Java virtual machine.

Then it is highly likely that you tried to start an app server with not enough free physical RAM. Check the "free -m" command on Linux. Otherwise refer to your OS specific manuals to see how to determine how much free RAM you actually have.

Tuesday, July 22, 2008

Reliability and Availability

Common topics in the performance space are reliability and availability. This article goes on to describe some of the problems that can occur in such an environment. This is the challenge of building reliable systems from unreliable components. From a hardware perspective this can be done if one has enough money for all the redundancy that is necessary. If one tries to do this on the cheap they will fail.

There is an interesting point in the article that Google is trying to solve this problem with software. While there are products like WebSphere XD that provide software level solutions for some problems they can't solve the problem as easily as hardware can. For example, the database is slow. Giving the database faster disk, more RAM or a RAM backed SAN and you can eliminate that problem. There is relatively little you can do from a software perspective to fix that. Another example is a server goes down. Sure, we could route data to another server using a software component but then why not just do it from the hardware level? Okay, so there are a couple of places where software is useful like maintaining J2EE affinity but the same can be done by a hardware load balancer. It just depends on where you put the smarts.

The problem with software to try and fix this is that it introduces another, more complex, layer of hardware/software where as redundant hardware makes things a lot simpler.

A few people will argue that software is cheaper. I don't agree with that argument. I think hardware is cheaper. Especially when it makes troubleshooting that much easier (and quicker) than home built software. If it takes 6-12 months to debug software in production then that is a lot of money (and bad press earned) down the drain.

Amazon S3: For now at least, sometimes you have to reboot the cloud | News - Business Tech - CNET News
Afterward, Om Malik called cloud computing frail: "The S3 outage points to a bigger (and a larger) issue: the cloud has many points of failure--routers crashing, cable getting accidentally cut, load balancers getting misconfigured, or simply bad code. And he's right, to a degree, but there are three things that shouldn't be overlooked before writing cloud computing off as a failure.

Friday, July 18, 2008

Negative testing

The following article is a good example why companies writing software need to hire subject matter experts when it comes to testing their applications. Particularly in what is commonly referred to in performance speak as "negative testing." This is where we, subject matter experts on performance testing, purposely cause a negative event to occur. For example, I routinely disable the Network Interface Card (NIC) [also known as your ethernet card] while running load/stress tests just to see how the application environment handles the event. If the application breaks then it fails the test and a defect is written up against the application and back to development it goes. It is easy enough in Unix environments to disable a NIC card but if worse comes to worse I'll pull the ethernet cable out of the jack. Crude but it works just as well.

Irish Examiner | Airport radar meltdown due to 'faulty' component
The malfunctioning network card, a component that allows computers to communicate with each other, was also blamed for previous glitches in the Dublin system.
It is unfortunate that the people that put that airport radar system didn't conduct negative testing because a problem like the one that occurred could have been completely avoided.

Likewise, while they are adding more monitoring I'm dubious that will help them. The fact that they haven't tested for negative events what other negative events they haven't thought of could occur? For example, some of the others I routinely test for are lost packets in the network, total network failure, network lag, 100%+ CPU, low memory, too many airplanes in the radar, duplicate radar images, etc, etc, etc and the list goes on and on.

All they need is for a different negative event to occur and they could (and probably will) suffer another outage. What they need to do is get a subject matter expert to teach them how to test their code.

BTW, notice the sentence about "delays were still being experienced at peak times"? Seems someone hasn't done stress testing either...

Wednesday, July 16, 2008

IBM Support Assistant

There is an update to the IBM Support Assistant. If you are running WebSphere Application Server and you do not have this tool then click on the next link and download it.

IBM Software Support - Overview
The IBM Support Assistant is a complimentary software serviceability workbench that helps you resolve questions and issues with IBM software.

Friday, July 11, 2008

createOrWaitForConnection and why we need a finally block

In previous versions of WebSphere Application Server this method actually had a name I liked better which was: createOrWaitForVictimConnection where the victim connection was one that was not closed within the same thread that opened it and eventually reaped by the app server. But either way, if you see this message timeout in your log file then that means somewhere, somehow someone has not closed a connection to a pooled resource properly. Get the following 3 words into someone's vocabulary... try, catch, finally. I can't emphasize enough how important it is to close the connection in the finally block. If you don't, then any exception that occurs can leave un-closed connections hanging around. If you are in a high volume environment you'll find this to be a serious bottleneck! Follow the following psuedo code...


Connection con;
try {
con = ds.getConnection();
// do some work
methodThatUsesConnection(con);
} catch (Exception e) {
//maybe log an error here if you like?
} finally {
con.close();
}

Off topic alphaworks listing

I try not to go off topic as this is a performance blog but some folks might find the following technology preview useful.

alphaWorks Services | IBM Pass It Along | Overview
A peer-to-peer knowledge exchange network that builds communities of experts and learners around "nuggets" of knowledge.

Wednesday, July 2, 2008

Do you run WebSphere? Then you need this diagnostic tool!



IBM: IBM Support Assistant

I have used this tool (and its predecessors) so frequently I don't go anywhere without it/them. Not only does it produce handy little graphs like the one above showing Java GC but it also provides some darned good analysis on recommended changes to the JVM command line parameters (especially if you're running on WAS v6.0.x or earlier which do not run on Java 1.5) to improve your memory utilization. Now, of course, you can only get this kind of feedback from the tool if you followed one of my earlier recommendations to turn on verbose GC. You have turned on verbose GC by now, right? If you haven't then you have to go and do that right now.

So, go to the link for the IBM Support Assistant and download this tool.

ITCAM instrumenting your own method capture

One of the nice things that application monitoring tools provide is the ability to specifically measure information about your own method calls. This page describes how to do so with the IBM ITCAM tooling.

Help -
A custom request is an application class and method that you designate as an edge or nested request. When the method runs, a start and end request trace record is written to the Level 1 or Level 2 tracing.

WebSphere Process Server Performance

The first week of June I got to work in person with a colleague of mine, Richard Metzger, from the IBM labs in Böblingen on a process server engagement. Richard has started his own performance blog for process server!

WebSphere Process Server Performance
Thoughts and opinions around performance of and capacity planning for IBM WebSphere Process Server and other products (like e.g. DB2), as they are used in the context of business process management and business process automation solutions.

Monday, June 23, 2008

kill -3 does not produce a javacore

This is a sporadic problem I run into when kill -3 does not generate a javacore and the process has to be terminated. The first thing to check is the service release of the JVM and ensuring the JVM is at the correct level with the corresponding level of WebSphere Application Server. In some cases I have installed later fixes than those that have been tested with WebSphere Application Server.

Another suggestion a colleague of mine had was to generate an AIX core of the process (make sure the ulimits for file and core are properly set to unlimited but you already knew that because you followed the installation instructions for WebSphere Application Server). I don't remember which kill signal needs to be sent but I'm sure a Goggle search will reveal that answer.

Tuesday, June 17, 2008

Get thread dumps during supplemental load tests

I recently found a bug in an application that the developers were not aware of. The code had a synchronized block they thought would be low cost. Low and behold in our load testing we found that after a certain number of users were active their response times started to go up exponentially. I took some thread dumps and found the synchronized block of code.

For anyone interested in performance:

1. Load test
2. Get thread dumps during bad response times.

If neither is done there will be problems in production. If you do not know how to analyze javacores open a PMR with IBM and IBM Support can help identify the problem.

Wednesday, June 11, 2008

WebSphere Process Server - database configuration

It is crucial that a WPS gold topology have the databases properly configured. If they are not (i.e. all pointing to the same database instance) there will be contention issues that will not resolve themselves. One of my esteemed colleagues wrote this great article.

Building clustered topologies in WebSphere Process Server V6.1
This leads you to the database settings screen, arguably the most complex of all the steps.

Wednesday, June 4, 2008

Why cross cell data centers are not a best practice for disaster recovery

This topic has been coming up time and time again. It is time that people read about the trade offs when trying to conduct DR with a single cell across multiple data centers. Yes, this might work. But more often than not the various interconnects between the two data centers and the interaction can lead to very negative consequences that disables the intended DR effort.

Do the right thing. Isolate the two data centers with separate cells. You'll find this works not only much better but has a very high success rate if done correctly (i.e. you use scripting to build your environments therefore having repeatable processes across DCs).

Comment lines: Tom Alcott: Everything you always wanted to know about WebSphere Application Server but were afraid to ask -- Part 3
While the notion of a single cell across data centers is bad from a risk aversion perspective, running a cluster across two data centers not only requires you to forget about minimizing risk, as noted above (since a cluster cannot span cells), but further increases risk along multiple dimensions.

Tuesday, May 27, 2008

Selling Application Monitoring

Take the word "security" and replace it with "application monitoring" in this article and you have my problem. Application Monitoring is really important.

Schneier on Security: How to Sell Security
How to Sell Security

It's a truism in sales that it's easier to sell someone something he wants than something he wants to avoid. People are reluctant to buy insurance, or home security devices, or computer security anything. It's not they don't ever buy these things, but it's an uphill struggle.

Thursday, May 22, 2008

Recent performance related articles

Every once in a while I take pen to paper (well, more correctly fingers to keyboard) and key out another article. Here are a couple I've written this year.

Comment lines: Alexandre Polozoff: Cultivating a performance specialist
Comment lines: Alexandre Polozoff: How well does traditional performance testing apply to SOA solutions?

Disaster Recovery & AIX GLVM

IBM eServer - Using the Geographic LVM in AIX 5L
"GLVM can help protect your business from a disaster by mirroring your mission-critical data to a remote disaster recovery site. If a disaster, such as a fire or flood, were to destroy the data at your production site, you would already have an up-to-date copy of the data at your disaster recovery site."

Absolutely critical technology if you are at all interested in DR (Disaster Recovery). I know at least two customers who have used this with synchronous updates (i.e. the local disk update is synchronized with the remote disk update) and seeing little overhead even over a distance of about 150 miles between data centers. This is an ideal technology for people looking to setup DR for their WebSphere Application Server, Portal, Process Server, MQ, DB2 and the list goes on and on. I'm absolutely excited about this technology and the potential impact this can have for our high availability site customers.

If you haven't read up on GLVM I highly recommend taking the time.

JVM memory and high CPU

In the "Butterfly Effect" a butterfly flapping it's wings on one side of the world can create a typhoon on the other.

In the world of Java: memory usage can cause high CPU. Summer of 2007 I spent 3 weeks working with a customer in the UK and we spent a few days measuring the memory usage of the application. We worked with the developers at reducing the memory footprint. In many cases these were simple code changes not requiring architectural or design changes to the app.

Reducing an application's memory footprint reduces the amount of garbage collection the JVM needs to execute. GC will use up CPU so obviously executing fewer GC cycles reduces the CPU load.

The UK application went from about 80+% CPU down to 25-30% for the exact same load test and better response times.

Verbose GC is your friend here too. Another application I'm looking at right now is suffering from high CPU and we can see in verbose GC that during these high CPU events the JVM is actively GCing because it is running low on memory. Obviously the memory settings need to be changed here but I wonder if we spent some time profiling the app and reducing its footprint if we wouldn't have to? I guess it depends if they will take the time to work on this effort.

Wednesday, May 21, 2008

JDBC driver versions

In helping one of my colleagues this week I've come across another common problem that should be audited by everyone running a J2EE app server; JDBC driver version.

Typical scenarios:
1. If you're application has been working and all of the sudden starts to have strange SQL errors it never had before
2. Your application works with intermittent SQL errors (unrelated to SQL statement bugs).
3. See a lot of StaleConnectionExceptions

Check the version of the database server and then the version of the JDBC driver (in WebSphere Application Server the JDBC driver version is printed in SystemOut.log during the app server startup sequence). Typically the DBA just updates the server without telling anyone. This means that a bunch of clients are backlevel on the JDBC driver. I don't know what it is that some database vendors do in their fixpacks but they often seem to break the protocol used by the previous client drivers. So... audit your JDBC drivers periodically and if you can get your DBA to notify you of when updates are going on to the database servers you could even save yourself some grief in production.