Cloudera Releases New Version of Its Apache Hadoop Distribution as Competition Mounts

Cloudera, one of the primary contributors to Apache Hadoop, has released a new version Hadoop distribution: Cloudera’s Distribution including Apache Hadoop v3 (CDH3).

The new version contains over 1,000 patches and changes, many of which will be contributed back to the open source project. CDH3 includes a full stack of software, from the operating system to tools for working with Hadoop, such as Pig and Hive. CDH3 is free and open source – Cloudera makes its money selling enterprise support and management tools.

The announcement follows recent announcements that DataStax, Hadapt and Mapr are joining the growing number of companies competing with Cloudera.

At the GigaOM Structure Big Data event earlier this month, Datastax announced a new product called Brisk earlier this month. Brisk is a fork of Apache Hadoop that replaces the Hadoop file system and Hbase datastore with Apache Cassandra, another BigTable-inspired database. Datastax is the sponsor company of Cassandra and sells enterprise support and management tools.

At the same event two new Hadoop-focused startups, Hadapt and Mapr, were also announced. Mapr replaces the Apache Hadoop file system with its own proprietary alternative, and Hadapt aims to bring SQL-like functionality to the platform. Appistry also offers an alternative file system for Hadoop.

These companies join IBM in selling Hadoop-related products and services. IBM has its own Hadoop distribution, and sells a Hadoop-powered InfoSphere product geared towards making Hadoop easier to use.

Cloudera executives are dismissive of the compeition, and aren’t shy about it. Charles Zedlewski, VP of products at Cloudera, told us in an interview that Brisk isn’t a “real” Hadoop distribution without the Hadoop file system and that “It’s astounding how little interest there is in Cassandra, so they need to use the Hadoop name.” At the GigaOM event, VP of Engineering Amr Adawallah told Derrick Harris that DataStax was making a “big mistake.”

This sort of infighting between Apache project contributors is disappointing, but to his credit Cloudera CEO Mike Olson did tell Harris “I believe there’s an enormous opportunity for smart companies, and even open-source projects, to build a new generation of data analysis tools on top of that platform.”

DataStax VP of Marketing Michael Weir was very civil when discussing Cloudera. He says Brisk was created to meet customer demand. Regarding the use of the name Hadoop, Weir says “We’ve been entirely transparent about what we’re using from Apache Hadoop and what we’re not.” You can find details in the white paper DataStax published. Weir says the Hadoop community has been welcoming, and that the company will be contributing its work on Hive to the Apache project.

As for demand for Cassandra: it’s in use at companies like Facebook and Twitter, and DataStax counts companies such as Netflix and Rackspace as customers.

Zedlewski was equally aggressive regarding IBM, saying IBM has not made any contributions to Hadoop and “IBM is offering a warranty on a car they never worked on or built.” He notes that Cloudera has been working with Hadoop for nearly three years now.

A three year head-start may not seem like all that much, and a few years down the road it won’t seem like much at all. But Cloudera has a team of engineers that have always been very close to the Hadoop project. Doug Cutting, the creator of Hadoop, works for Cloudera and Adawallah has been involved for quite some time as well. Having top flight talent is Cloudera’s ace in the hole.

Yet not even this advantage is complete assurance against future competition. Harris writes:

Also at the event, two independent sources suggested members of Yahoo’s Hadoop team will be spinning off their own separate business, and there is speculation this move is somehow tied into EMC’s Hadoop plans.

IBM isn’t to be taken lightly, nor is EMC on its own, but the latter turn of events would be a potentially market-changing situation given the Hadoop know-how within Yahoo, which has contributed the majority of the code now included in Apache Hadoop.

EMC is making a Hadoop-related announcement on May 9, but we don’t yet know what it will involve.

Disclosure: IBM is a ReadWriteWeb sponsor.