Eliminate outages: the engineering challenge

Outages this summer at Google, the New York Times, Microsoft, NASDAQ stock exchange, and numerous smaller-scale organizations brought attention to information technology (IT) failures. Is there a technical fix for all this downtime? Is application performance management (APM) a ‘silver bullet’ which slays such problems?

Yes and no. APM certainly doesn’t directly address the security vulnerabilities most visible in, for instance, the battles between the Times and the “Syrian Electronic Army”. Most outages or breaches are inadvertent, though, and APM is a great contributor to “continuous operations“.

Bringing APM into clear focus

APM is both more and less than vendors often advertise and customers understand. We can make always-up and (nearly) error-free applications–but only by adopting NASA-like procedures and policies which are prohibitively expensive and leave no room for the responsiveness and flexibility organizations say they need from software.

Everything else is a compromise. A multitude of techniques and concepts vie for decision-makers’ attention. For me, APM hits a “sweet spot”: it automates and scales well, its out-of-pocket cost is manageable, it relates directly enough to end-user experience (EUE) to promote productive conversations between technicians and managers, it has the potential to integrate adequately with other tools and assets, and it is sufficiently mature and objective that it leads to definite improvement. APM makes sense to the devops in the datacenter at the same time as it directly address bottom-line business benefits.

APM has limits. Installation of APM today doesn’t eliminate outages tomorrow; in the acute case, all it accomplishes is to ensure that the organizations’ devops know about problems as soon as end-users. However, one aspect of APM too often under-sold is the extent it helps improve reliability over time. The historical monitoring that APM enables gives alert organizations the chance to recognize and diagnose problems early, when they’re small. The effect is to reduce turn-around on many outages from a five-hour disaster down to the five-minute or five-second level where they’re barely noticed.

APM also makes a basis for the “performance modeling” that growing organizations need. On the other hand, some commentators seem to believe that APM’s strategic potential means it eliminates the need for lower-level networking monitoring. This is not true; it’s generally important to be able to integrate APM and network performance management (NPM).

Just as it’s hard to prove a negative, elimination of outages is a big, maybe impossible, challenge. Wise use of APM, though, can be part of a comprehensive plan to reduce outages to an insignificant level.