Double Trouble

I was sitting in my living room reading a little bit, it was 1:00am, and suddenly I hear this annoying water-drop sound, repeating itself quite disturbingly. So I wandered around, and it turns out I had a leak in the tap that supplies the water to my washing machine. I looked behind it and noticed a small lake had formed.

Ten days later (I kid you not!) the guy from the washing machine company came to fix the problem. I told him there was a leak from the tap. He started disassembling the washing machine and explained to me in detail how I should always keep this compartment clean and that compartment clean (where you put the washing liquid) and I’m like: yeah, yeah, just solve the goddamn leak so we can wash our clothes, man. By this time, the family had started to wear very unfashionable clothes…and new dirty cloths got dumped in a pile. You can say I suffered from this malfunction. He worked on it for like 30 minutes, cleaned this and cleaned that, told me that that’s the reason I had a leak from the tap, turned the machine on and everything seemed to be OK. No tap leaked. However, once the machine started to drain—meaning water got poured out to the pluming hole—water started to fill the room. The plumbing was blocked.

The way the water filled the room never happened before this maintenance guy came and started playing with it. However, it really did not seem to have anything to do with him. And it was definitely a problem. So maybe this is the reason there was a pond behind my washing machine in the first place, and not the leak from the tap. I did not know the answer, but I worked on solving the pluming block myself. Eventually the washing machine returned to work, no water was pouring out to the kitchen floor, and there was no immediate leak from the tap as before. Problem solved.

It is amazing how IT applications act the same as washing machines. I don’t know why, it’s beyond me, but in the really hard performance problems we had to deal with in our DoctorIT service, there was a clear phenomena which the customer suffered from (poor response times, a performance bottleneck, etc.), and there was a root-cause we found that beautifully explained the phenomena. The customer swore that there was no other performance problem before, and since we isolated a well-defined root cause, we were certain that we found and solved the performance problem. However, that did not happen. Can you guess why?

Because the really hard-to-detect problems—the ones that are tricky and keep giving you a hard time to detect and isolate them—come in PAIRS. They don’t like to play alone. It’s not fun for them. They need a friend. The moment you put your hands on one of them, the other hits you from behind. You tend to forget they are a pair, and all your calculations and assumptions focus on isolating a single one, but that is the worst mistake you can make.

First lesson in solving hard Heisenbugs: assume, from the beginning, that there is more than one performance bottleneck.

If you are lucky, there is only a single performance bottleneck, and your work will be fast and smooth. However, you have to assume there is more than one of these application performance problems, and then you won’t be surprised. This way you can better manage your resources for solving the actual bottlenecks and your customer’s expectations will be set accordingly.

So let’s get a little bit technical and describe a real case study of detecting and isolating a performance bottleneck in an enterprise application. We were called in after the customer had been suffering from poor application performance and response times for over a month. The application would freeze or hang for a few minutes, and then return to work smoothly. The application hangs happened once a week on average, with no obvious pattern, and no code change that could hint where the performance problem might be. We recorded the transaction traffic and resource usage of the application (millions of HTTP requests and SQL statements per day, for a period of week), until the application hung again. Resource consumption analysis revealed that the poor application performance was originally due to memory consumption in the application server > which then caused other resources to be highly utilized > lack of memory caused a high paging rate > which caused a high read/write disk queue > which was the immediate cause for the performance bottleneck and poor response times.

Our theory was there was a single instance of an http request that triggered an event that over-consumed memory in the application server. We also assumed this event started minutes before the actual response time degradation occurred. We searched for a long running transaction that started 5-10 minutes before the poor response times began, and lasted at least until 10-20 seconds before the application hung. We indeed found one! We rushed back to the customer site and managed to recreate the transaction, and cause the performance problem all over again. Using specific parameters, this transaction would cause enormous memory consumption. The problem was quickly detected, isolated, and fixed. The customer was happy, we were happy, and our bank account was almost happy.

Two weeks later, the customer called again, and notified us that the problem happened again. How can that be? We knew for sure it was fixed! Well, to make a long story short, it turned out there was another transaction, which under specific parameters generated hundreds of SQL queries that loaded the database, which also began in the same month.

Different root cause entirely—same phenomena. We should have known: Double Trouble.
By the way, my washing machine still leaks from time to time. God knows why. Until next time.

Lanir