question of who really wrote Hadoop. Hint: It's not just Yahoo and Hortonworks, as some might have you believe.One of the questions that comes up frequently in open source projects is "who's contributing to this thing?" For single-company efforts like MySQL, it's usually pretty obvious where the bulk of contribution is coming from. But for projects like the Linux kernel or Hadoop, a little digging is in order. The problem with measuring contributions to projects is it's not trivial figuring out how to credit contributions from individuals as they move from one company to another. Consider, for instance, the
Owen O'Malley of Hortonworks ran some numbers and came to some interesting (and very pro-Hortonworks) conclusions. Specifically, O'Malley says that "until this past June, Yahoo! contributed more than 84% of the lines of code still in Apache Hadoop trunk" and then goes on to say that (so far) in 2011 the biggest contributors to Hadoop are Yahoo! and Hortonworks. Not so fast, says Mike Olson of Cloudera.
The problem, Olson says, is that O'Malley didn't correctly attribute the companies that the contributors are working for. "Five years is an eternity in the tech industry, however, and many of those developers moved on from Yahoo! between 2006 and 2011. If you look at where individual contributors work today – at the organizations that pay them, and at the different places in the industry where they have carried their expertise and their knowledge of Hadoop – the story is much more interesting."
If you look at the data gathered by Olson, things look much less one-sided in favor of Yahoo. The exodus began in 2009, and the list of contributors by company is long indeed. The contributors include Twitter, Microsoft, Apple, Google, Ask.com, LinkedIn, IBM and many others.
Looking at the complementary projects, the community contributions become even more obvious. Olson says that "most of the innovation around Hadoop is now happening in new projects. That's not surprising – as Hadoop has matured, the core platform has stabilized, and the community has concentrated on easing adoption and simplifying use." Contributions still roll into Hadoop, but they also go into supporting technologies like HBase, ZooKeeper, Pig, Mahout, Bigtop and a lot other creatively named projects.
The 25% Rule
The contributions that come into the Hadoop ecosystem don't come from only one company, says Olson. "No one company sponsors more than a quarter of the new innovation in the Hadoop ecosystem and nearly half of all new patches are sponsored from a long tail of corporate benefactors and freelancers. In fact, I expect this picture to get more interesting over time. Just since the beginning of 2011, established companies like IBM, EMC, Informatica, Oracle and Dell have announced plans to invest in the Hadoop ecosystem in various ways."
Contributions to HADOOP, HDFS and MAPREDUCE as a percentage of total ecosystem contributions
It might seem unhealthy for companies to be clamoring for credit in open source projects, but it's a sign of health for projects. If companies position themselves to be top contributors, and care about their standing, the projects win. Users win too. Developers in the ecosystem also win – since it's far easier to hire existing contributors than trying to push outsiders in to a project.
Olson's observation about no companies doing "more than a quarter" of innovation is important. It means that the share of work is well-distributed, and no one company can dominate the project. If, say, Yahoo dropped out of Hadoop altogether it might be bad for Hadoop, but not fatal. When companies aren't crowing about their contributions, or when one company really does dominate, it might be time to worry.