There was no choice—we had to move. Although there is nothing more annoying than moving to new offices, we are growing. It took a while, but finally we found our new space and we had to do a little refurnishing before the actual move. The previous tenants of our new space, for some reason, decided they should take with them their entire data center infrastructure—they literally cut all the internal communication cables, in a brutal and strange manner, so they can take their racks with them…it is as if someone told the movers to ‘take the racks’!
The first thing we wanted to make sure was that the new wiring to our communication racks was working. We took the wireline technician to the communication room where there was this cable hanging from the ceiling with a phone end-point, so we automatically assumed that it is one of the phone lines in the office. Next step was to find the hub that is wired to this phone line, so the technician can define the phone line as ours and enable it for internet. So we’re looking around in our offices, which was then filled with construction workers and paint brushes and we can’t find the hub. The technician remembered that he already did some wiring work in the building, and he thinks it’s in the other side of the building in one of the offices. We kept on looking in different places, with no success, and eventually decided we need to enter one of the abandoned offices. But the building maintenance guy left already. No keys. End of first session.
The next day I verified the key to the office, and we immediately found the hub and the technician got to work. We then returned to our space to see if the phone line is enabled. It wasn’t. At this point I waited for him, and he kept on hopping between the hub we found (which was in the other side of the building…a five minute walk each time…and it was hot) until he told me that it just doesn’t work and we’ll have to buy new infrastructure for the phone lines. So I shouted a bit, and talked to his manager, and eventually someone in the phone company told them there might be another hub in another office, the one next door to us. We were thrilled with joy, but of course it was already late. Again, no keys. End of second session.
The next day, we went to our neighbors, and indeed found another hub. The technician was exhilarated, and started working again. The same evening I got a call that there is still no active phone line. Again–we’ll have to lay down new phone line wires—but not from the other side of the building, just from the neighbors’ hub. End of third session.
At this point, I was starting to feel that we were missing something. I went late at night to our new space, after all the handymen left, and entered the communications room. I took the phone line that we assumed IS the wired phone line, and started to trace its trail. So I started to pull it, and I pulled and pulled—the phone line came out in my hand. I mean the other side of the phone line came out in my hand. IT WASN’T CONNECTED TO ANYTHING. NO WONDER THE TECHNICIAN COULDN’T GET THE PHONE LINE TO WORK!
I kept looking in the ceiling and eventually found a different cable that looked like a phone line, and seemed as if was coming from our next-door neighbors. Of course this one turned out to be the REAL phone line.
It immediately reminded me of the first thing I do when I get called to a customer site during a malfunction of some kind. I ask, “What’s your current goose chase?” Which means: what is the current theory you are trying to validate, although there is no special reason to pursue it, and the only reason you ARE pursuing it is that you have to do something, otherwise your managers/colleagues will start to think that you don’t have a clue, which is exactly the situation.
It usually happens with malfunctions in multi-tier applications. These events involve so many different technical experts that goose chases are bound to happen. For example, I once got called to a customer, whose application was freezing 10 times a day. It happened in his online critical secured financial application—which uses SSL of course—and it didn’t happen in his online content Website, which was not SSL encrypted. Each application had a different app server, so there were a zillion possible reasons for the problem, but nevertheless the system guys came up with their own theory: there are problems where SSL is used, and there aren’t problems were SSL in not used. So it is the SSL! Why not upgrade the SSL accelerator? Yeah! That’s a good idea! Let’s do that!
And believe it or not, that’s what they did. When I came it was after a couple of work days already invested in upgrading this SSL accelerator, that OF COURSE had NOTHING to do with the actual root-cause.
My advice: define the phenomena as accurately as you can, and then try to collect as much useful data as you can. Transaction response times broken down by tiers, resource usage within each tier, all available logs from each tier, throughput of each tier, and it all has to be collected in the granularity of the problem—the average behavior in the last 5 minutes probably won’t cut it. Once all the data is available, and aligned according to a time line, there are many questions that can be answered.
It’s always astonishing to me that this basic first step is usually not implemented. If you don’t have the data, invest time in creating/deploying some useful probes that will collect the data you need, and don’t forget to do so during the malfunction!
Beware of Goose Chases!