Monitor even what you trust: a non-linear imperative

“Trust, but verify”: whether you credit Ronald Reagan for this, or the traditional Russian proverb from which he acquired it (in Russian, it rhymes: Доверяй, но проверяй) by way of Suzanne Massie, this slogan applies to information technology (IT) even more deeply than you might realize. This is particularly important for application performance management (APM) and security.

Think for a moment about APM. In broad terms, it emerged five or so years ago, then, in the last couple of years, lost priority within many enterprises. One of the reasons for its deprecation among relatively sophisticated Lean-oriented managers is that they prefer to “build in quality“, or, in this specific case, performance. As powerful as that idea is, it’s a mistake to conclude that high-quality construction makes application monitoring superfluous. In fact, it’s two different mistakes.

First, sustainable quality requires measurement. I understand the impulse that a sufficiently-enlightened development process will result in high-quality applications; this intuition fails to recognize, though, that APM needs to be part of development from the time a project begins. IT Ops has several articles this summer on TestOps which teach that testing, monitoring, and measurement need to play an even larger role in Agile and Lean methodologies. Mobile developers have taken the lead lately on preaching the importance of measuring performance from an application’s birth.

A second reason for the necessity of APM is more subtle, and harks back in a different way to President Reagan’s slogan. Some managers sincerely believe that verification applies only when there’s a lack of trust: for them, security is about keeping out bad guys, and sufficiently well-engineered production systems can be relied on to deliver consistent results. The facts simply don’t support that well-meaning instinct. It’s a commonplace in systems administrator folklore that “most security breaches are ‘inside jobs’.” At the same time, the profoundly non-linear nature of application performance means that it’s prohibitively expensive to engineer in performance without run-time monitoring.

The fallacy in play seems to be that, once an application has been shaken down and performs adequately, it will continue to do so, especially if well-managed in the sense that resource tolerances are built-in. However appealing it is to double CPU, storage, network speed, and so on, to accommodate doubled traffic, there’s really no guarantee of a happy outcome. Network engineering in particular is notorious for paradoxical performance.

Notice it’s the most careful and conscientious analysts who are prone to thinking they can do without monitoring. What they’re missing is the systemic non-linearities of computer applications. Highway traffic illustrates part of this pattern: a busy route can be flowing along briskly, then, with no warning–no accidents or lane reductions or other obvious explanations–there’s a “jam”. This isn’t (primarily) because of untrustworthy behavior or unexpected loads; it’s a predictable result of the non-linear dynamics of the traffic system that it’s subject to “shocks”.

We see much the same back in the datacenter. As good as it is to build in quality and design resilient, capable systems, they inevitably will host vulnerabilities. Vigilant monitoring doesn’t mean we don’t trust our systems; we just know they’re important enough to merit careful oversight.

The consequences of non-linearity in computing systems and thorough description of, for instance, the hazard of premature optimization, are weighty enough to deserve book-length treatment. You don’t have to work through all those details, though. The conclusion for now is clear: develop and deploy systems with enough quality that you can trust them–but never stop verifying them through real-time APM monitors.