IT Change Mismanagement: 3 Common Worst Practices to Avoid in IT Operations

My last two posts have dealt with best practices in the realm of IT change management. I was thinking of my next topic while waiting for a delayed flight and was reading “The Ultimate Book of Heroic Failures” by Stephen Pile. This led me to examine some examples of worst practices I’ve seen in my years of solution engineering. Although the following are all hypothetical examples, they do represent problems which could occur in any IT environment. Note: these are generic examples to protect the innocent (these all come from simple human error which happens to us all).

Worst Practice 1: Hard Coded Configuration

This scenario exemplifies what can happen when IT is forced to cut corners and rush in order to make deadlines. Say you are a developer implementing some changes to database connection methods. You go into the code and you set up a change in a customer facing application. Because you’re pressed for time, you just put the database details into the code and make a mental note to go back later and do it properly. After some time passes it goes into production and there are performance issues. Hours can be spent examining all aspects of the production database before anyone realizes the configuration simply isn’t being used and there is a small UAT database in a dark corner of the data center sweating in it’s new found production role.

Solution: If you’re ever caught in a situation like this you may think you can find the troubled component quickly. Sometimes you will guess right and the problem will be fixed. Other times you will be led on a wild goose chase. To find the problem 100% of the time you need some sort of transaction tracing capability. Then, you would be able to see exactly which components the troubled transaction travels through and identify the bottleneck.

Worst Practice 2: Load Balancer

In this environment you have one load balancer talking to two data centers, A and B, each with two web and app servers. You set the configuration for the load balancer on the new application to span all four and sleep happily with your implementation of disaster recovery best practices.

There is a mistake. The configuration gets screwed up and transactions are only flowing to data center A. Normally, this wouldn’t be a large problem. With your big fancy budget you have plenty of capacity that using half of the cluster doesn’t impact the service. But what happens if data center A goes down? All the effort you put in would be completely wasted, your service is completely down and there is an unimpressed set of business owners on the outage bridge. How would you fix this?

Solution: Normally, once you have everything running, you don’t look over all your log files periodically to check they make sense. This could take over a day in a complex application, and is normally a very subjective exercise. With topology mapping of transaction flows, you can see what is and isn’t being used. The picture shows you are just using data center A during normal running and this can be raised and resolved without any end user impact.

Worst Practice 3: Application Roll Out

Your team is preparing for a new application to go live on Monday morning in an environment which has four application servers. During the app roll out on Sunday night there is a deployment hiccup. You set up the debug logging on node 1 and find a missing JAR, you roll it out to all 4 nodes and life is good. Now, let us fast forward to Monday when the users come into work and the load ramps up. Operations start off the week with the sobering news that a quarter of your users are experiencing massive performance issues. You look at the users, their work site and department and can’t find a correlation. What is going on here?

Solution: This common issue goes to show that innocent mistakes can be made in any environment. The debug logging is still on from the deployment over the weekend, so there is 1 node in the cluster which now shows a performance degradation compared to the other 3. Cross node analysis enables rapid identification of the problematic node where the debug logging can be quickly identified and resolved. In fact, the moment the issue is tied to a single node it can be removed from the cluster for offline analysis providing an immediate resolution to the response time issues.

These three examples can really happen to any IT team if they are rushed into making changes to their environment. More than ever these days, IT is overtasked and underappreciated. Complex, heterogenous environments are compounding this problem and making day-to-day life for IT Operations difficult. This is where transaction centric application monitoring can play a role. Otherwise, your example may serve as a cautionary tale of worst practices that IT should avoid.