Business Intelligence was the buzzword of the 1990s, scoring oodles of venture capital cash and plenty of customers, a huge percentage of which never got much value from their hefty investments. The promise of BI, like Big Data today, was to give business users the tools to turn raw data into actionable insights.
That was the sell, anyway. Despite grandiose promises, much of BI’s potential was obscured by the cost and complexity of deploying it.
A new generation of BI solutions like Tableau have arisen to strip out the complexity of yesterday’s BI. Unfortunately, virtually all of this newfangled BI remains fixated on structured data buried in relational databases. Most of the world’s information is semi-structured or unstructured, making today’s data a poor fit for yesterday’s BI.
When is someone going to create an BI offering born for Big Data?
Born At The Right Time
Actually, someone just might have done this, though you may not have heard of them yet. Zoomdata is one of the latest entrants into the Big Data BI market, with venture backers that include NEA and Accel Partners. Given how much money VCs wasted on the last round of BI, this isn’t all that impressive by itself.
No, what sets Zoomdata apart for me is not their VCs or even their customers (some of which are quite large—more on that below). Rather, Zoomdata’s magic is that it is built for the unstructured Big Data world. Rather than stripping away rich data models in Hadoop, MongoDB or Cassandra, Zoomdata embraces them.
Equally powerful, however, is the fact that Zoomdata moves the question to the data whereas other vendors move the data into a black box or traditional RDBMS before running a query.
That takes time. And network bandwidth.
In other words, Zoomdata figured out a way to stream process the results of a query back from the original data source and display the results in a sketch view that gets sharper as more data is processed. Compare that notion to watching the first few seconds of a streaming movie. Users start to see results immediately and get a good sense of what’s going on, and a short time later they see it all.
I recently spoke to Zoomdata CEO Justin Langseth about Zoomdata and his plans for world domination of data visualization and BI analytics.
Lost In Translation
ReadWrite: How relevant is traditional BI in a world moving to Hadoop and NoSQL where data is more and more unstructured or semi-structured? In my experience too many BI vendors try to push an ODBC driver approach on their customers, sacrificing much of the richness of modern data technologies. [ODBC is a way of translating data between an application and the database, with a lot lost in translation when a NoSQL database is involved.]
Justin Langseth: BI is very relevant to Hadoop, NoSQL, and semi/un-structured data. In fact the last company I founded, Clarabridge, is all about BI on unstructured data. Semi-structured (JSON, XML), key-value, and raw data of various forms are where most data growth is in the Big Data world.
But it’s critical that BI tools natively connect to these new sources, and especially to leverage the power of the clusters behind them, instead of extracting data from them into proprietary cubes or a traditional RDBMS database. You really want to connect to them through their native APIs, not through some kind of inefficient layer that attempts to provide SQL access.
RW: So how does Zoomdata consume data from NoSQL or Hadoop? How is your approach any different from the traditional analytics vendors?
JL: While we can support Cassandra, HBase or other NoSQL databases, I’ll use MongoDB as an example. With MongoDB specifically, we natively connect to the MongoDB aggregation API, and leverage it to perform sort, count, group, and other aggregate operations. We also use our micro-query engine against the MongoDB API, which allows for incremental data sharpening.
See also: Why Data Scientists Get Paid So Much
We can show users in seconds an estimated view that morphs into the final view as they watch. This leverages the power of the underlying MongoDB clusters without extracting raw data into something else, and without requiring translation through a SQL conversion layer.
We can also visualize real-time data that is being fed into MongoDB through another process, or can optionally receive real-time data into Zoomdata and have Zoomdata land it into MongoDB for historical storage.
Either way, Zoomdata then allows for a DVR-like interface on top of MongoDB data to switch between real-time data views and replaying or fast-forwarding through history. We consume data in the same “native” fashion with Cloudera Impala, Spark, Amazon Redshift, ElasticSearch, Solr and various other relational databases and streaming APIs.
Lowering The Bar
RW: In your view, what are the biggest challenges that organizations face who want to make better use of all of their data?
JL: The biggest challenge is making the new developments of Big Data accessible to business people who are not data scientists or BI specialists.
Traditionally the BI industry has done a reasonable but not spectacular job of allowing business analysts to access relational database data. Today, however, more and more jobs are becoming data-driven, while the underlying data is becoming more and more non-relational and “big.”
In parallel with this, human users are becoming used to a simple Apple-like user experience, and want their enterprise applications to be as pretty and easy to use as the apps they run on their iPhones.
So the biggest challenge is how to provide a beautiful, simple, yet powerful interface and underlying tech stack to allow regular business people to access, visualize, and collaborate around data that is residing and streaming into a variety of big data backends, and do that efficiently at large data and user scale.
Getting Started With Big Data
RW: How should organizations get started with their “Big Data” project, assuming they don’t have a bevy of data scientist gearheads on staff?
JL: For companies with a moderate but not massive IT capability, consider limiting the number of Big Data backends that you use to one, or a small handful.
In terms of Hadoop, the industry is quickly moving to Spark. So instead of worrying about using the Hadoop 1.0 tools like Pig, Hive, HBase, just go with Spark.
Also there are data preparation tools that are now natively operating on Spark, such as Trifacta and Paxata. So for a company that wants to adopt a next-generation data stack today from scratch, I’d recommend picking a single key-value/document datastore such as MongoDB, and a single data-and-processing-at-scale system such as Spark, and maybe skip the rest of the Hadoop stuff other than Spark.
And to run Spark, consider an on-premise option such as Cloudera CDH, or a managed service option such as Databricks. Then look for next generation data tooling such as Trifacta, Paxata, and Zoomdata to sit on top of this next-generation stack.
Typical Use Cases
RW: What are the typical use cases for your customers? I’ve heard from more than one source that you have pulled in a number of big deals, including at least one over $10,000,000 and several in the six- to seven-figure deal range. Are those typical?
JL: While I can’t comment on specific deals, I can say that the most typical use case for our customers is simply to power a data-driven application or data-driven service.
I can give you three examples off the top of my head.
We have one customer who has lots of real-time and historic cell phone location data with demographics stored in Cloudera Impala. They need a way to allow their end users, who are their end customers and who are non-technical, to visualize and analyze that data.
Another customer is building an application for the entire fashion industry to analyze product, color, pricing, and availability trends in the fashion/clothing industry. This data is being collected from various source and then stream-loaded into MongoDB. Zoomdata sits on top to provide the analytical and dashboarding experience.
The third customer has a huge amount of medical data stored in Cloudera Impala and Cloudera Search, and needs a way for drug researchers to explore and understand patterns of disease and treatment efficacy over many years of history. We make that exploration and visualization of the analytics easy and fast for non-data scientists.
Lead image courtesy of Shutterstock