Blinded By Big Data: It's The Models, Stupid

In the Gold Rush to accumulate and put to use Big Data, we may actually be making it harder to actually glean insights from that data. Yet the prospect of data solving all of our problems is so tempting that even the brightest minds of our generation seem confused.

Take, for example, Irving Wladawsky-Berger, a strategic advisor to Citigroup and former IBM executive. Wladawsky-Berger is exceptionally bright, someone whose insights into open source helped me a great deal while he was still at IBM. But writing in The Wall Street Journal ("Spotting Black Swans with Data Science"), he "[gets] the Black Swan idea backwards," as Nassim Taleb, professor at New York University’s Polytechnic Institute and author of The Black Swan, points out

Completely. Backwards.

Predicting Black Swan Events

Black Swan events are major events that take us by surprise, but afterwards yield clear explanations as to why they happened. Examples include the 9/11 attacks, the rise of the Internet and World War I.

But they can also apply to business, and so there is a temptation to apply Big Data to spot such Black Swans before they happen. As Wladawsky-Berger writes:

This [Big Data] ability to work across data sets and silos could help us get early clues to hard-to-predict, high-impact black swan events, so we can dig deeper into these clues and assess their validity.  When experts investigate catastrophic black swan events, be they airline crashes, financial crises, or terrorist attacks, they often find that we failed to anticipate them even when the needed information was present because the data was spread across different organizations and was never properly brought together.

Unfortunately for Wladawsky-Berger's analysis, Black Swans, by their very definition, cannot be predicted by analyzing the data. Yes, Black Swan events always look eminently predictable in hindsight, yet no one ever predicts them. 

More Data, More Problems

Equally unfortunate, the more data we throw at the problem, the more impossible it becomes to predict such events, as Taleb highlighted on Twitter:

The bigger the data set, the harder it becomes to sift through the noise to find the signal, because we're more prone to fixate on incorrect correlations between disparate data sets. As Taleb goes on to note, "The world has today between 50K and 100K variables, hence >1 billion correlations. So the spurious will be used." 

Or as Taleb writes in The Black Swan:

In business and economic decision-making, data causes severe side effects - data is now plentiful thanks to connectivity; and the share of spuriousness in the data increases as one gets more immersed into it. A not well-discussed property of data: it is toxic in large quantities - even in moderate quantities.

Which problem gets worse the more often we look at the data:

The more frequently you look at data, the more noise you are disproportionally likely to get (rather than the valuable part called the signal); hence the higher the noise to signal ratio.

So what to do?

Better Models, Not More Data?

Simon Wardley, an innovation researcher at CSC, suggests we should look beyond bigger data to better models, holding that "Historically, it's been about relative balance and flow between unmodelled to modelled. The value is not the data but the models."

Yet as software engineer Simon Wart reminds us, before the Wall Street meltdown "even low-level IT peons knew the models were a joke [but] our tribal mindset blinded us to the consequences."

Which may well be the problem: we are human. All too human.

Whether in our models, our collection of certain kinds of data, or our interpretation of that data, we bring personal biases to the analysis, which Microsoft Research's Kate Crawford argues in Harvard Business Review. We cannot avoid this bias, and the attempt to look for correlation rather than causation in our data solves nothing.

In fact, it arguably makes the problem worse, because it gives us too much confidence in our data.

A Little Data Never Hurt Anyone

The trick, then, is to approach our data with caution. It's not that data can't help us anticipate the future. It can. Just ask the City of Chicago, which has a very successful predictive analytics platform used to anticipate crime and health trends, among other things.

But there is a reason most enterprises still use Big Data technologies like Hadoop to solve old problems like ETL, rather than analytics. We're still early in Big Data, and enterprises rightly suspect that Big Data isn't some magic pixie dust that immediately yields insights into how much to charge, where to market, etc. Big Data can help, but it's not The Answer.

And it's certainly not the answer to predicting Black Swan events. To do that, you don't need data. You need hindsight.

Image courtesy of Shutterstock