Just as the traditional sequential (and slow) “waterfall” approach to software development has started to fade into disrepute, a similar approach to Big Data analytics has surfaced. For too many organizations, devising incredibly potent models for predicting behavior is becoming an end in itself, obscuring the process by which we learn from our data.
Indeed, much of the focus on Big Data lies in accumulating more and more data to improve our ability to predict what consumers will buy, which customers will churn, etc. And so we put inordinate amounts of effort into perfecting our prediction models rather than learning from their many failures to do so.
Unfortunately, our data infrastructure too often gets in the way of our ability to embrace failure, which is why the cloud is so important to Big Data.
The Process Of Prediction
As Michael Schrage, a research fellow at MIT Sloan School’s Center for Digital Business, stresses:
[The] most enduring impact of predictive analytics … comes less from quantitatively improving the quality of prediction than from dramatically changing how organizations think about problems and opportunities.
In other words, if we’re paying attention, the process can help us “better understand[] the real business challenges [our] predictive analytics address.”
But to do this well, we need to be willing to fail. Again. And again. As Schrage notes:
Ironically, the greatest value from predictive analytics typically comes more from their unexpected failures than their anticipated success. In other words, the real influence and insight come from learning exactly how and why your predictions failed. Why? Because it means the assumptions, the data, the model and/or the analyses were wrong in some meaningfully measurable way.
Failure, then, is the key to learning from Big Data. Hadoop vendor Cloudera rightly challenges us to “ask bigger questions,” but a key component of these questions is iterating through trial-and-error toward the right questions to ask.
Institutionalizing Failure In The Cloud
While a cloud environment won’t kill a company’s fixation with Big Models for Big Data, it sets the appropriate tone for experimentation. Big data is all about asking the right questions. Hence the importance of domain knowledge.
This is why I keep coming back to Gartner analyst Svetlana Sicular’s contention that “Learning Hadoop is easier than learning the company’s business,” which means that the first place to look for Big Data expertise is in-house, not the land of magical data-science fairies.
Even so, no matter how smart you or your data-science team is, your initial questions are almost certainly going to be wrong. In fact, you’ll probably fail to collect the right data and to ask pertinent questions—over and over again.
As such, it’s critical to use a flexible, open data infrastructure that allows you to continually tweak your approach until it bears real fruit.
In a conversation I had with Matt Wood (@mza), general manager of data science at Amazon Web Services, he describes just how hard it is to approach data correctly when our hardware and software infrastructure gets in the way:
Those that go out and buy expensive infrastructure find that the problem scope and domain shift really quickly. By the time they get around to answering the original question, the business has moved on. You need an environment that is flexible and allows you to quickly respond to changing big data requirements. Your resource mix is continually evolving—if you buy infrastructure it’s almost immediately irrelevant to your business because it’s frozen in time. It’s solving a problem you may not have or care about any more.
Cloud, in other words, is all about creating a culture that can iterate without fear of failure.
All Your Big Data Are Belong To The Cloud
This isn’t to suggest that cloud obviates failure. Quite the contrary. As Wood says, it’s all about making the cost of failure acceptable: “You’re going to fail a lot of the time, and so it’s critical to lower the cost of experimentation.”
It’s also not to suggest that Big Data projects will only succeed in the cloud. As Shaun Connolly, vice president of Strategy at Hortonworks, a leading Hadoop vendor, told me:
I believe there will be multiple centers of data gravity, one of which is on-premises. But I am convinced Hadoop in the cloud plays a significant role in the broader architecture as the Hadoop market continues to mature.
In sum, Big Data doesn’t have to be in the cloud, and for many workloads it may make sense to store, process and analyze the data on-premise. But for building a culture of experimentation, the essence of Big Data discovery, cloud is critical.
Image courtesy of Shutterstock