Big Data May Be A Pretty Small Problem

The idea that a business needs data analysis to better make business decisions is not in dispute… but there is currently a strong debate on how big a data set a business actually needs and how much they need to spend to get that data.

The lure of big data is a powerful one… your web site is flooded with tracking and logging data, after all, and if you only had the tools to store and analyze that data, you could learn the secrets of making your business successful, discover the Colonel's secret recipe and figure out the Question that goes with the Answer 42.

Well, maybe not that much detail, but with the level of hype around big data, one sometimes wonders.

One standard approach to analyzing this data is the installation and configuration of Hadoop servers that are grouped together in clusters of machines - either physical or virtual. Hadoop clusters use distributed storage that makes it relatively simple to store a lot of data fast with less pain than relational database configuration. They also use Java-based MapReduce software to reach into that data and scoop out what you really want - golden nuggets of pure information.

There are limits to MapReduce, naturally: it doesn't perform analysis in real time, but rather in occasionally time-consuming batches, and setting up MapReduce software to do exactly what you need has been compared to getting a root canal. This is why there is an entire ecosystem around Hadoop dedicated to working around those shortcomings, introducing real-time analysis, structured database tool, and software that can convert existing database queries written in Structured Query Language (SQL) to something MapReduce can handle.

But even though Hadoop is relatively inexpensive and easy to scale out onto many machines that run the Linux operating system, is this approach the equivalent of using a wrecking ball to knock down a dollhouse?

Too Much Data?

Some would argue that is indeed the case. A January 2013 paper from Microsoft Research, for instance, disputes the notion that most data analysis that a business would even need a Hadoop cluster, but instead could use a more powerful single server that is scaled-up.

According to the authors of "Nobody Ever Got Fired For Buying a Cluster," the data set sizes of many given businesses are not typically large enough to warrant scaled-out clusters of multiple computers.

You would expect that to be the case for small- to medium-sized businesses (SMBs), but it's also true for enterprises. Even the mega-companies for which big data tools were practically invented don't need those tool a large majority of the time.

For example, the authors found, an analysis of 174,000 jobs submitted to a production analytics cluster in Microsoft had a median job input data set size of less than 14 GB, and 80% of jobs had an input size of less than 1 TB.

The paper cites another study from K. Elmeleegy that "analyzes the Hadoop jobs run on the production clusters at Yahoo. Unfortunately, the median input data set size is not given but, from the information in the paper we can estimate that the median job input size is less than 12.5 GB."

And Yahoo, by the way, is where much of the core functionality of Hadoop was developed, built on the distributed filesystem research conducted earlier at Google. If they aren't using Hadoop for mega jobs all of the time, how appropriate is Hadoop for a "normal" enterprise's data sets?

Facebook, the Borg-like consumer of all user data, surely needs the big data tools, right?

"Ananthanarayanan et al. show that Facebook jobs follow a power-law distribution with small jobs dominating; from their graphs it appears that at least 90% of the jobs have input sizes under 100 GB," the paper states. "Chen et al. present a detailed study of Hadoop workloads for Facebook as well as 5 Cloudera customers. Their graphs also show that a very small minority of jobs achieves terabyte scale or larger and the paper claims explicitly that 'most jobs have input, shuffle, and output sizes in the MB to GB range.'"

Most Data Is Small

The conclusions of the paper, which analyzes various configurations of Hadoop jobs in clustered computers, both physical and in the cloud, against a single scaled-up Hadoop cluster, found that for a majority of data analysis work, the scaled-up server not only handled the workload well, it actually outperformed the clustered machines in many respects.

Now, like any scientific paper, particularly one from a commercial vendor, some skepticism must be applied. Here, the conclusions would seem to benefit Microsoft's sales model for pushing data analysis tools into the enterprise and even SMBs. Scaled-out Hadoop clusters on Linux, after all, are pretty cheap compared to comparable Windows Server clusters, but even the least expensive Hadoop cluster can't hold a candle to the low price of a single scaled-up server.

Which may be the point of the paper, so take it as you will.

Still, there seems to be compelling evidence from sources other than Microsoft that there is a vast majority of data analysis jobs that do not need much more than a strong server or even a personal computer to crunch the numbers and get those golden nuggets of information.

This is not to say that every data problem can be solved with an Excel spreadsheet and a laptop. The flexibility of non-relational (NoSQL) databases are still a very attractive solution to storing and analyzing data sets. And Hadoop is still a relatively inexpensive way to store a lot of data until such time you need to massage it and discover the secrets of the universe or at least your third-quarter sales.

(See also Hadoop Adoption Accelerates, But Not For Data Analytics.)

Before beginning an exploration into the world of big data, businesses should be careful on separating hype from reality and making sure they don't overkill their data needs with a solution that will be more costly to set up and operate in the long run.

Look at NoSQL databases as a way to hold and analyze data for lower costs than relational SQL databases. Or look at federated data services that can provide key information aggregated within your particular sector. And even look at the data you have and start playing around with it in a spreadsheet sometime and see what you come up with.

Hadoop is one way to work with data, but it is by far not the only way.

Image courtesy of Shutterstock.