Over the years, Gartner has taken its fair share of criticism for hype and wildly wrong predictions. But in a Twisted Sister moment of karmic payback, Gartner analyst Nick Heudecker has come out swinging in a new report that rails against one of the latest examples of Big Data hype—what he calls the “data lake fallacy.”
(More about data lakes in a moment. For now, all you need to know is that they’re basically the opposite of a data warehouse, meaning they’re huge pools of data stored in its original format instead of being collated, sorted and filed.)
Heudecker acknowledges that these data lakes provide near-term benefits to enterprises. But while “the marketing hype suggests that audiences throughout an enterprise will leverage data lakes,” he argues that most people won’t have the necessary skills to take advantage of the data.
In other words, for many, “data lake” roughly equates to “unsupervised digital landfill,” as one Fortune 100 IT executive described it.
Hating On The Analysts
It’s always been fun to pillory analysts for alleged bias toward big vendors that can afford to pay them and for being lagging indicators of big computing trends, among other things. I’ve done my share of carping on analysts, including Gartner, for getting trends like open source wrong.
But as fun as it may be to call out analysts for being human, these individuals also have to slog through their share of vendor reality distortion fields and other vendor silliness. Much of the confusion over what to do with Big Data is largely the fault of the vendors that sell technology around it.
No wonder, then, that Heudecker’s colleague, Merv Adrian, occasionally throws his tweeting hands up in disgust:
Pro tip: when I KNOW you sold almost nothing last time, telling me you’re “up 60%” with NO numbers is a yawner. And insulting.
— Merv Adrian (@merv) July 23, 2014
Even so, analysts are generally a pretty temperate bunch, rarely overtly criticizing vendors or their sloganeering.
Man Bites Dog
It was therefore surprising to see Heudecker go after one of the latest buzzwords making its way around Big Data circles: the data lake. Espoused by a variety of vendors (usually Hadoop vendors, but not exclusively so), the data lake is a mythical happy place for data to reside in its native format until someone within the enterprise needs to analyze it.
Or, as Heudecker describes it:
The idea is simple: instead of placing data in a purpose-built data store, you move it into a data lake in its original format. This eliminates the upfront costs of data ingestion, like transformation. Once data is placed into the lake, it’s available for analysis by everyone in the organization.
Sounds great, right? To an extent, it is. Pivotal and GE say that they’ve been able to jointly cut analysis times from weeks to days by eliminating the need to “spend a lot of time, effort and money on getting the data into the right format.”
But what isn’t mentioned in the linked article above, or by any of the companies marketing data lakes, is this, per Heudecker: “Since data lakes lack semantic consistency and governed metadata, [data lake] positioning assumes those audiences are highly skilled at data manipulation and analysis.”
He goes on:
Data lakes typically begin as ungoverned data stores. Meeting the needs of wider audiences require curated repositories with governance, semantic consistency and access controls — elements already found in a data warehouse. The fundamental issue with the data lake is that it makes certain assumptions about the users of information. It assumes that users recognize or understand the contextual bias of how data is captured, that they know how to merge and reconcile different data sources without ‘a priori knowledge’ and that they understand the incomplete nature of datasets, regardless of structure.
Some do, of course. But most don’t. (And finding those that do, as McKinsey & Co. notes, is not trivial.)
Small wonder, then, that Heudecker ironically notes that “most vendor offerings or discussions about data lakes include thinly veiled offers to build the surrounding workbench, services deployment, metadata and professional services.”
In other words, there’s lots of assembly required for getting value from the data lake—and there are lots of data-lake promoting vendors lining up to help put all that together.
The Data-Lake Effect
Not that the data lake is a doomed concept.
Edd Dumbhill, vice president of strategy at Silicon Valley Data Science, agrees with Heudecker’s general analysis, but remains optimistic.
Dumbhill acknowledges that the data lake is a “dream, because we’ve a way to go to make the vision come true,” but insists that it is “an accessible dream.” He further suggests that Google and Facebook already live this dream, while siding with Heudecker’s concerns that Big Data vendors selling the data lake dream have yet to solve its challenges of “managing provenance, data discovery and fine-grained security.”
In short, data lakes and other Big Data dreams can be very real, just as GE has experienced. But when vendors sell them as a panacea to Big Data woes, without calling out the very real problems with the approach, we risk scaring off buyers that need truth, not fiction.
Lead image by Max Charping