In the ER, IT is Like Heat, Power and Light

Some systems are mission critical. But when systems go down at a hospital, it’s not just mission critical, it’s critical to life.

Dr. John Halamka is a former ER doctor who turned to technology. He is now CIO of CareGroup whose flagship teaching hospital is the Beth Israel Deaconess Medical Center in Boston. Dr. Halamka is an IT veteran who experienced a history-making four-day network outage in 2002. CIO magazine called it “one of the worst health-care IT crises in history….”

No patients suffered an injury, but for four long days doctors could not order medicine or receive lab tests. Decision-support systems went off-line. No one could read their email.

The doctor blames that catastrophe on an out-of-date system, built without redundancy, in a time in history (11 years ago in IT is a long time) when doctors and staff were less dependent on computers. Reflecting upon the incident he said, “[today] there is a sense that IT is like heat, power, and light–always they are assumed to be high performing.”

After gaining control of the situation, the doctor turned around IT operations and infrastructure at the hospital to such extent that InformationWeek named CareGroup number one in its annual ranking of innovative IT groups for two years running.

Two months ago, the system went down again. A virtual storage array winked off. Staff lost email, lab results, images, and could not enter orders for procedures and medicines. There were two choices: either roll back in time or rebuild the data from incremental backups and re-index the database. The second choice would take time, but no data would be lost. The outage lasted only a few hours. The hospital followed downtime procedures and engineers worked to minimize the impact on systems that were still working, albeit slowly.

The incident was contained, because of improvements made in part from lessons learned in 2002. The doctor says among the changes he made since 2002 were:

  • Change the network topology to segment the network so that a surge in traffic in one area does not propagate to the rest.
  • Establish a relationship with the vendor (Cisco) to monitor their network and highlight weaknesses.
  • Add out-of-band tools to monitor the network.
  • Document and rehearse a downtime test plan including what to do about communications (At the hospital, that included walking lab results to doctors and nurses on foot.).
  • Do root-cause analysis.
  • Document everything. (Only one employee knew the system well enough to understand all of it.)
  • Use configuration management software.
  • Replace the management team (ouch).
  • Make sure the board of directors funds regular infrastructure upgrades.

Let’s hope no hospital experiences a four-day downtime again. At CareGroup, that is not likely as Dr. Halamka is ahead of the problem.