Yahoo is making a deeper commitment to open source software with the announcement this week that it has joined the Linux Foundation.
The Linux Foundation is dedicated to the advancement of Linux. For Yahoo, the decision to join is about its continued investment in Linux, virtualization and file systems technologies
Apache Hadoop is a file systems technology. And Yahoo was one of its early users and funders. This is not to say that Hadoop will be the focus of its collaboration, but Hadoop is primarily designed for Linux environments.
We asked Yahoo about the technology the company plans to focus on with the Linux Foundation. It was noncommittal and did not refer to Hadoop at all. It only said it would be working on virtualization and a variety of technologies.
But Hadoop is important to Yahoo, as it is to other companies that need distributed data storage.
Yahoo started using Hadoop in its labs. Today, it is used for a variety of purposes. Here’s a bit from a post I wrote last year:
Yahoo! started using Hadoop initially in 2006 as a science project to process and analyze massive data sets. They developed a prototype on 20 nodes. Today, Yahoo! manages more than 25,000 nodes for data processing and analytics.
Yahoo! found that product development could be done in a fraction of the time. They found they could just throw machines at a project to do the processing. What once took 29 days could be accomplished in less than one.
As a result, Yahoo! began integrating Hadoop for all parts of its business. The company offloaded storage from the IT department and put the data in a cluster.
Today, Yahoo uses Hadoop for determining best advertising placement and for content optimization. For example, the company started testing how the optimization worked on the home page by serving up content relevant to the user. It worked. Yahoo! saw a 150 percent increase in user engagement measurements.
In the story, Yahoo executives told me they will use Hadoop to continue working on data center latency, a problem that has emerged with the scaling of servers in a data center environment. What is the other issue that comes with scaling? The management of thousands of virtual machines, which helps us understand Yahoo’s interest in virtualization.
In a video from the Linux Foundation Collaboration Summit, Sven Dummer, Yahoo Director of Linux Engineering, said that Yahoo’s infrastructure is on Linux to serve the 640 million users that use the site every day. He said Yahoo has 11 billion page views per month and serves 200 petabytes of data per day. He said Yahoo is one of the largest data center providers in the world.
Yahoo will contribute in several ways to the Linux Foundation. It will participate in the working groups and initiatives focused on virtualization, cloud computing and legal topics, such as open compliance.
We expect that Hadoop will be a part of its focus, too.