Growing Doubts About Big Data

By Gary Langer April 8, 2014

There's quite a kerfuffle going on in the world of big data, with a range of prominent articles in the past month suggesting it's not the analytical holy grail it's been made out to be. Taken together, these pieces suggest the start of a serious rethink of what big data can and can't actually do.

Perhaps most prominent is a piece in the journal Science on March 14. It builds from an article in Nature last year reporting that Google Flu Trends (GFT), after a promising start, flopped in 2013, drastically overestimating peak flu levels. Science now reports that GFT overestimated flu prevalence in 100 of 108 weeks from August 2011 on, in some cases with estimates that were double the CDC's prevalence data.

As well as picking apart GFT's problems (inconsistent data source, possibly inconsistent measurement terms) the authors blame "big data hubris," which they define as "the often implicit assumption that big data are a substitute for, rather than a supplement to, traditional data collection and analysis." Fundamentally, they add: "The core challenge is that most big data that have received popular attention are not the output of instruments designed to produce valid and reliable data amenable for scientific analysis."

While "enormous scientific possibilities" remain, the authors say, "quantity of data does not mean that one can ignore foundational issues of measurement." They also point (as we have in the past) to the vulnerability of some big data sources (including Twitter and Facebook) to intentional manipulation.

A prominent research statistician and author, Kaiser Fung, followed with a pretty sharply worded blog post, not only calling the GFT flu estimates an "epic fail" but saying it's emblematic of a broader problem in the big data world: "Data validity is being consistently overstated."

As to GFT itself, he added, "Google owes us an explanation as to whether it published doctored data without disclosure, or if its highly-touted predictive model is so inaccurate that the search terms found to be the most predictive a few years ago are no longer predictive. If companies want to participate in science, they need to behave like scientists."

Fung and the Science piece both were quoted in an op-ed in this Sunday's New York Times , in which a pair of New York University professors take their turn, pointing out, for instance, that large datasets can produce large numbers of correlations that are merely spurious, that "many tools that are based on big data can be easily gamed" and that analytical tools can create an "echo-chamber effect," for example when Google Translate pieces together translation patterns on the basis of articles that have been produced using… Google Translate.

There's more: A piece in the Financial Times on March 28, "Big data: are we making a big mistake?" presents another pointed look at the shortcomings in big-data analysis, suggesting that reliance on correlations in the absence of a theory of their cause is "inevitably fragile." The size and inherent messiness of big data, the piece adds, can conceal misleading bias within. It includes this comment from David Spiegelhalter, a professor at Cambridge University: "There are a lot of small data problems that occur in big data. They don't disappear because you've got lots of stuff. They get worse."

Finally, there's a paper prepared for a conference of the Association for the Advancement of Artificial Intelligence by Zeynep Tufekci, an assistant professor at the University of North Carolina, Chapel Hill, entitled, "Big Questions for Social Media Big Data: Representativeness, Validity and Other Methodological Pitfalls." She lays out a range of difficulties in attempting to draw meaning from social media data in terms of sampling and analytical challenges alike, including many discussed in our own briefing paper on social media, first released in August 2012.

None of these pieces suggests that the concept of big data is dead. Rather they represent a pullback from the heady notion that very large datasets can somehow allow researchers to set aside the niceties of sampling, theory and attention to measurement error. More sober days may follow.