Batch Your Big Data Jobs—Or Stream Them?

Even among the über-sexy big data elite, Apache Spark is smoking. Promising dramatically better performance on in-memory (100x faster than Hadoop’s MapReduce!) and on-disk (10x faster!) storage, Spark seems to be leading the charge into a beautifully fast Big Data future.

According to some, Hadoop’s batch-oriented days—that is, where you have to pile all your data together, process it through Hadoop and then interpret the output—may be numbered. But while alternatives to batch processing certainly look promising, rumors of Hadoop’s death may be a wee bit exaggerated.  

Why Batch When You Can Stream?

Just as Hadoop started to hit mainstream consciousness, some people started touting The Next Big Thing. As Databricks engineer Patrick Wendell told me in an interview, we are at the “beginning of what will likely be a major expansion of streaming workloads over the next few years.” Such workloads would start yielding results while the analysis was still underway, rather than forcing you to wait for the entire job to finish.

Of course, saying “streaming analytics” is a lot easier to say than actually implementing it, according to Wendell: 

The big technical challenges with streaming are around operational complexity. Streaming programs are inherently more complex to maintain then offline batch processing engines, you have to be “always on,” have quick response time, and deal with bursty incoming data. Furthermore, it can be expensive from an engineering perspective to maintain two different stacks: one for batch processing and the other for streaming.

The answer, according to Wendell and another streaming analytics pioneer, Zoomdata, is to consolidate big data technologies around streaming analytics. But the two companies approach the problem differently.

Streamlining Big Data

For Databricks, the company behind Apache Spark, the best approach is to “unify the streaming programming model with batch,” as Wendell explains. Doing so—as Databricks accomplished with Spark Streaming—”lets users take existing business logic and apply it in real time.” This means that “All of the effort they put into writing code to define metrics, do anomaly detection, etc., they can do it directly on their streaming data.” 

The big payoff? “They only have to maintain one software stack.”

But just as importantly, as Cloudera’s Ted Malaska highlights, is that Spark Streaming allows you to “create data pipelines that process streamed data using the same API that you use for processing batch-loaded data.” 

Not everyone agrees.

“Unnecessary Tradeoffs”

According to Zoomdata CEO and co-founder Justin Langseth (with whom I spoke recently about business intelligence and Big Data), batch-oriented systems like Hadoop are unnecessary in an increasingly real-time world:

There is no real need to batch up data given today’s modern architectures such as Kafka and Kinesis. Modern data stores such as MongoDB, Cassandra, Hbase, and DynamoDB can accept and store data as a stream, and modern BI tools like the ones we make at Zoomdata are able to process and visualize these streams as well as historical data, in a very seamless way. Just like your home DVR can play live TV, rewind a few minutes or hours, or play movies from last century, the same is possible with data analysis tools like Zoomdata that treat time as a fluid.

As Langseth told me in our interview, proposed new architectures that incorporate the best of batch and real-time are a step backward:

Those who have proposed a “Lambda Architecture,” which effectively separates paths for real-time and batched data, are espousing an unnecessary trade-off, one that is optimized for legacy tooling that simply wasn’t engineered to handle streams of data be they historical or real-time. At Zoomdata we believe that it is not necessary to separate-track real-time and historical data, as there is now end-to-end tooling that can handle both form sourcing, to transport, to storage, to analysis and visualization.

The key, as Langseth continues, is not to get mired in batch-oriented systems at all, even if you don’t currently care about real-time analysis of your data. Sticking with streaming data from the start “massively simplifies big data architectures [as] you don’t need to worry about batch windows, recovering from batch process failures, and so on,” he says.

In short, “even if you don’t need to analyze data from five seconds or even five minutes ago to make business decisions, it still may be simplest and easiest to handle the data as a stream nevertheless.”

The Future Takes A Long Time

Even if Langseth is correct, and developers are better off dumping batch for stream-based systems, it’s going to take a long time to get there. As Datastax senior community manager Scott Hirleman told me, “Truly forward thinking companies are just starting to experiment now [with streaming analytics] so that says that even to reach Hadoop’s level of awareness will be a few years or more.”

Real-time analytics may be a thing, in other words, but it’s a thing that will take a long time to really hit.

And when it does, as Hadoop creator Doug Cutting stressed in an interview with me, “streaming [will simply] join[] the suite of processing options that folks have at their disposal.” Streaming and Hadoop, in other words, not or.

This may not be the beatific future Langseth envisions, but it may be the best we get.

Facebook Comments