Root Cause Analysis

One should adopt a systemic approach to root cause analysis, so that the same procedure can be deployed over and again and improved with each iteration to find and correct weaknesses in processes and system. Plus to put this into a template means that you will develop measurable metrics regarding service levels and mean-time-to-repair. The persons running the conference bridge is responsible for tracking the time of each activity thus yielding that statistic. After an outage, or near outage, all parties involved should do a conference room review of what went wrong. Writing up the root cause analysis is tasked to the team leader and collected by the internal auditors who are tasked with monitoring service levels for customer compliance.

IT shops who are constantly putting out fires will have to make the extra effort to take time to do this, which should reduce their fire-fighting mode of operating.

Here are some ideas of what to put into the template. If the template is too long or rigid, people will just grow weary of it and not give it serious thought. So make it simple. You could use this simple example:

  • Description Problem
  • Effect on business and users,/li>
  • Root Cause
  • Follow up actions

And here is an actual example:

Description of Problem —o n March 11, the provisioning system experienced a near total outage because LDAP replication was not working on 2 out of 3 servers. Efforts to rebuild the second replica from the one server that was running were ultimately successful after 18 hours on the conference bridge spanning two shifts. If the last remaining server had failed, the company would have experienced a total outage and down time of perhaps a day. That would have stopped sales as new customers could not be provisioned. We currently have no way to failover beyond that, since as outage of 2 out of 3 servers is deemed extremely unlikely.

Effect on business and users—there was no impact to the business. But this was a high-risk situation that could have caused a complete outage. There were no latency issues running on one server, since one server can handle the volume of searches even at peak times. Updates are not an issue, since they are queued up. This only becomes an issue when the queue is filled. It is not necessary that user accounts are created instantly in all machines, since the provisioning system does not anticipate the user logging back in after their device has been provisioned and the sale is complete.

Root Causes—we do not know why one LDAP server crashed. The weakness in our system was monitoring of LDAP replication was not working. It failed to spot that one LDAP server became unresponsive. There is a script that runs an LDAP search against cn=replica, o=company.com. The script is supposed to not follow referrals, meaning fail over to a working machine. That means the LDAP search was shown as working, because it had failed over to a working server. This is a bug in this version of LDAP and Oracle is working to fix that.

LDAP update messages had grown in the ActiveMQ queue, monitoring failed to notice the backlog in messages. The queue has outgrown the buffer size for the one LDAP server which had crashed, causing the 2nd of the LDAP servers to overflow its message queue buffer size. This buffer size is not a physical limit but a service-level limit. There was no way that the second LDAP server would have been able to catch up replication within 24 hours. A decision was made to shut down the 2nd server, recover it from the morning backup, then let replication make it current again, which should have taken 2 hours.

Because the server simply froze up, there was no messages in the /var/adm/messages or the LDAP logs that revealed what had failed. The server had already been patched to the latest rev level, as had the OS, so it remains unknown why the server crashed.

Lesson learned—we have deemed the decision to shut down the second server faulty. The service-level should have been ignored in this case as it exposed the business to significant risk. Monitoring the LDAP server needs to be done at the log level, since running a search against the machine fails over to other servers and we cannot control that, because of a bug in the ldapsearch command. It is not possible to increase the queue size beyond the existing number, since replication beyond a service level of 4 hours would impact users who log in after provisioning their mobile device within 4 hours, which is not very likely. In that case, they might not find that their account had been created if the search for their account was executed against a server for which replication had not yet created their account.

Follow up actions—Oracle is working to recreate the issue in their support center. LDAP support is changing the monitoring script to look at the logs and no reply on the LDAP search.