One Hadoop Distribution To Rule Them All?

The Hadoop market is getting interesting. Last year it was a death match between startups vying to own the heart of the project. Today it's a veritable smorgasbord of big-brand vendors getting involved to ensure they claim a big piece of the Big Data pie. Unlike American youth athletics, not everyone will get to take home a trophy.

Hadoop plays a key role in the burgeoning Big Data market, and represents a $13 billion market by 2017, according to Markets and Markets. (IDC pegs the market much, much lower at $812.8 million in 2016, but its numbers don't seem credible to me as they don't even seem to include Cloudera's sales.) Given that Big Data is hot, and Hadoop's data processing engine sits at its core, there's going to be a lot of money trading hands for Hadoop-related products and services.

Not everyone is going to collect.

SiliconAngle's John Furrier has challenged me on this, arguing that Hadoop is "not a winner take all market." While I, too, can see multiple winners in Hadoop, just as there have been in Linux (e.g., Red Hat dominates license/services revenue, but IBM, HP, and others make arguably more with related hardware, complementary software products, and professional services), markets don't tend toward entropy. They trend toward consolidation.

Today, the Hadoop ecosystem increasingly represents entropy:

  • Cloudera, Hortonworks, and MapR remain the early favorites, but with very different approaches. Hortonworks positions itself as the 100% open source player; Cloudera somewhat does the same, but adds in complementary, proprietary bits, mostly around managing Hadoop, to add value to Hadoop (and its top line revenue); and MapR provides a hybrid open source/proprietary Hadoop distribution that swaps out HDFS for its proprietary NFS storage layer.
  • EMC Greenplum has been involved with Hadoop for several years, and is set to release a new distribution of Hadoop called Pivotal HD. I've labeled Pivotal HD proprietary, but EMC's Hadoop team has taken issue with this characterization, arguing that PivotalHD is 100% open source, with complementary functionality (like HAWQ) available as add-ons. Point well taken, and I apologize for my misunderstanding. I was wrong, perhaps not surprisingly getting confused by Pivotal HD's product page, which says little about open source. But what seems clear is that customers won't be confused by EMC's value proposition: Hadoop with an advanced SQL query engine to make it easier and more powerful to use.
  • Intel just got into the game with its own Hadoop distribution. Basically, you can think of it as Hadoop on (Intel Xeon™ processor, Intel SSD, and Intel 10GbE networking.hardware) steroids.
  • For those who don't want to run Hadoop within the datacenter, Amazon offers Amazon Elastic MapReduce (EMR). As of April 2012, EMR was powering over 1 million Hadoop clusters. Presumably this number is much bigger today.
  • Many, many others including IBM BigInsights, a range of startups, and more.

Will all of these companies make serious bank on Hadoop? No. Will some of them? Sure.

Ultimately, the winners in Hadoop will be those that invest most heavily in its success, as they will be perceived as the companies best positioned to help would-be customers succeed with Hadoop's complexities. But how they invest is up for discussion. Code to Apache Hadoop? Value-adding extensions?

Success isn't about open source purity, as Gartner's Merv Adrian posits: it's about making customers successful. As we saw with Linux, where Red Hat is both the top contributor to the Linux kernel and the company that harvests the most revenue from distributing Linux, contributing code is a great way to signal to the market that you're a leader and capable of getting code fixes to support customers. Code matters.

But code contributions are not the only way to demonstrate leadership and attract customers. Ultimately, companies that make it easier to get value from Hadoop will win big. There may be more than one such company. Indeed, there almost certainly will be. 

But there won't be 20 of them. Or even 10. Enterprise IT is simply not going to be able to manage a polyglot Hadoop distribution ecosystem. That's not the way markets work. No one wants to be "long tail" vendor, and customers don't want to buy from them, either, as Hugh MacLeod humorously points out on Gaping Void:

Source: GapingVoidArt. Used with permission. Source: GapingVoidArt. Used with permission.

The Hadoop market over the next year is going to be hugely interesting. And bloody.

Image courtesy of Ehab Othman / Shutterstock.