Big Data too often Bad Data

Big Data is a bubble. That doesn’t mean it’s wrong–just fragile and oversold. A few concrete examples illustrate the problem.

Teach him … epistemology

There’s plenty to like about Big Data; that’s already been well-publicized. Big Data’s advance through the hype cycle is also apparent, and needs no further comment. Far too little appreciated, though, is how easily Big Data successes can be vacuous. I write this as an occasional Big Data consultant and sometimes-enthusiast: there are plenty of cases where a Big Data project or task strongly signals a profitable business action, yet the project itself is a loss.

This distinction takes a bit of background to grasp. Consider for a moment “Clinical Query Puts Hospital At Cutting Edge“:

So with the help of Clinical Query, a clinician or researcher might search the records to find out how many patients with breast cancer also take ACE inhibitors, a class of drug used to treat high blood pressure. If the results reveal a strong correlation between the drug and the malignancy, the hospital could do a deeper analysis and set up a formal research project to investigate the link. The ultimate goal is to discover a new medical intervention that would improve the survival of the entire population of breast cancer patients.

That’s not all. Beth Israel Deaconess Medical Center (BIDMC) Chief Information Officer John Halamka has a mature understanding of the value of data, so that Clinical Query also automates preparation of documents for institutional review of an associated clinical trial, recruitment of pertinent subject populations, and so on: “It’s a huge time-saver”. Clinical Query carefully anonymizes data to protect privacy. Halamka’s team has made a major and on-going effort to refine the ontology of medical language to improve the coding of health records. “We were the first Web-based health record, the first computerized provider order entry system the first to use iPads”, Halamka boasts. This is Big Data at its best.

What’s the effect, though, when mediocre organizations–that is, most of them–undertake Big Data? How many businesses will recognize that the correct follow-up to identification of “a strong correlation” is “a formal research project”, not to start immediately with a new treatment protocol or, in the more likely business case, product sheet?

Medical applications, nominally constrained by high ethical and scientific standards, are fraught with difficulties. Clinical Query’s contribution to care of Native Americans living on Arizona reservations, or Haitian peasants, or Bengali villagers, is hard to estimate, because BIDMC’s patient population differs so much from the former. Statistical inference is a notoriously difficult and error-prone undertaking. In what we plausibly expect to be the best possible case of peer-reviewed medical research, “… Most Published Research Findings Are False“, and “… Evidence-Based Medicine has failed.”

Medicine has at least one more well-documented lesson for business applications of Big Data: even in the best circumstances where diagnosis is relatively certain and treatment possible, the costs of medical care itself might exceed their benefit! In business, Big Data might compute a result that indisputably leads to a 1.5% boost in gross margins–but at what cost? It’s not just that Big Data practitioners have salaries, and vendors have price schedules, that jointly testify to a bubble; more than that, Big Data-dictated initiatives can easily monopolize the attention of the top managers and domain experts who otherwise would be in a position to execute more fundamental improvements.

Making the best of Big Data

Big Data is not all bad–far from it. It’s instructive, though, to learn from the weaknesses medical research already highlights. IT Ops will return in future months to the broader topic of how to make the most of Big Data computing. In the meantime, keep in mind that part of the reason to emphasize application performance management (APM) projects is the clarity of their pay-off. Delays in bringing up end-user displays have clear, quantifiable consequences, and APM, including Big Data-oriented methods, is a great way to prevent and mitigate those losses.