Big data is big news, but it’s still in its infancy. While most enterprises at least talk about launching Big Data projects, the reality is that very few do in any significant way. In fact, according to new survey data from Dimensional, while 91% of corporate data professionals have considered investment in Big Data, only 5% actually put any investment into a deployment, and only 11% even had a pilot in place.
Perhaps the biggest reason is that Big Data technologies are still way too hard to use—and sometimes insufficient for the kinds of data enterprises want to put to work.
But that’s changing. Last week I sat down with Justin Langseth, CEO of Zoomdata, to drill into the future of big data, and to better understand the intersection between batch-oriented technologies like Hadoop’s MapReduce and Spark, a real-time processing engine . While I published excerpts from that conversation in an earlier ReadWrite piece, it’s really worth reading Langseth’s observations in total.
See also: Batch Your Big Data Jobs—Or Stream Them?
Real Time Gets Real
ReadWrite: Hadoop has been all about batch processing, but the new world of streaming analytics is all about real time and involves a different stack of technologies.
Langseth: Yes, however I would not entangle the concepts of real-time and streaming. Real-time data is obviously best handled as a stream. But it’s possible to stream historical data as well, just as your DVR can stream Gone with the Wind or last week’s American Idol to your TV.
This distinction is important, as we at Zoomdata believe that analyzing data as a stream adds huge scalability and flexibility benefits, regardless of if the data is real-time or historical.
RW: So what are the components of this new stack? And how is this new big data stack impacting enterprise plans?
JL: The new stack is in some ways an extension of the old stack, and in some ways really new.
Data has always started its life as a stream. A stream of transactions in a point of sale system. A stream of stocks being bought and sold. A stream of agricultural goals being traded for valuable metals in Mesopotamia.
Traditional ETL processes would batch that data up and kill its stream nature. They did so because the data could not be transported as a stream, it needed to be loaded onto removable disks and tapes to be transported from place to place.
But now it is possible to take streams from their sources, through any enrichment or transformation processes, through analytical systems, and into the data’s “final resting place”—all as a stream. There is no real need to batch up data given today’s modern architectures such as Kafka and Kinesis, modern data stores such as MongoDB, Cassandra, Hbase, and DynamoDB (which can accept and store data as a stream), and modern business intelligence tools like the ones we make at Zoomdata that are able to process and visualize these streams as well as historical data, in a very seamless way.
Just like your home DVR can play live TV, rewind a few minutes or hours, or play moves from last century, the same is possible with data analysis tools like Zoomdata that treat time as a fluid.
Throw That Batch In The Stream
Also we believe that those who have proposed a “Lambda Architecture,” effectively separating paths for real-time and batched data, are espousing an unnecessary trade-off, optimized for legacy tooling that simply wasn’t engineered to handle streams of data be they historical or real-time.
At Zoomdata we believe that it is not necessary to separate-track real-time and historical, as there is now end-to-end tooling that can handle both from sourcing, to transport, to storage, to analysis and visualization.
RW: So this shift toward streaming data is real, and not hype?
JL: It’s real. It’s affecting modern deployments right now, as architects realize that it isn’t necessary to ever batch up data, at all, if it can be handled as a stream end-to-end. This massively simplifies Big Data architectures if you don’t need to worry about batch windows, recovering from batch process failures, etc.
So again, even if you don’t need to analyze data from five seconds or even five minutes ago to make business decisions, it still may be simplest and easiest to handle the data as a stream. This is a radical departure from the way things in big data have been done before, as Hadoop encouraged batch thinking.
But it is much easier to just handle data as a stream, even if you don’t care at all—or perhaps not yet—about real-time analysis.
RW: So is streaming analytics what Big Data really means?
JL: Yes. Data is just like water, or electricity. You can put water in bottles, or electricity in batteries, and ship them around the world by planes trains and automobiles. For some liquids, such as Dom Perignon, this makes sense. For other liquids, and for electricity, it makes sense to deliver them as a stream through wires or pipes. It’s simply more efficient if you don’t need to worry about batching it up and dealing with it in batches.
Data is very similar. It’s easier to stream big data end-to-end than it is to bottle it up.
Lead image by Roman Boed