Things are getting lively in the Hadoop community, especially between Hortonworks and Cloudera. The issue? Which companies are contributing the most to Hadoop, and how contributions ought to be tallied up.
It started with Owen O’Malley of Hortonworks, who did some calculations of contributions to Hadoop based on lines of code. The problem is that O’Malley credited work just by looking at the initial employer of contributors, rather than employers at the time of contribution.
Patches or Lines of Code?
Mike Olson of Cloudera took another whack at the numbers, which I looked at last week. Olson broke out the numbers by looking at the patches contributed to Hadoop and its ecosystem (projects like HBase, ZooKeeper, Pig, Mahout and Oozie).
O’Malley has come back with a counter-post that tallies contributions by lines of code but sticking to Cloudera’s method of counting current employer. The result shows Hortonworks far ahead of Cloudera, Facebook, IBM and even Yahoo. For 2011, according to O’Malley, Hortonworks has contributed more than 42% of the lines of code to Hadoop, Yahoo nearly 26%, and Cloudera a bit more than 15%. Lines of code are a better measure, says O’Malley, because “patches differ in their investment of time and effort.” (Of course, the same thing can be said about a line of code, too.)
Finally, O’Malley does provide a comparison that looks at patches and lines of code since 2006 and another comparison for 2011 alone. This puts Cloudera in a much better light, with nearly 30% of patches in 2011 so far, compared to 25% for Hortonworks and about 23% for Yahoo.
Lively Competition
If you’re going to be comparing contributions, I think that the best way is to sum up patches and lines of code. There’s really no concrete way to objectively say “company Y absolutely contributed the most” to a project just by counting code. A company’s code contribution might be a small code drop that adds a killer feature. A company’s contribution may be a series of patches that effectively removes thousands of lines of code, but improves the project with better code.
I think it’s safe to say that Cloudera and Hortonworks are both providing a good showing when it comes to Hadoop contributions, regardless of which company is actually contributing the most. And the results show that Hadoop is getting contributions from a healthy group of companies.