A freely accessible index of 5 billion web pages, their page rank, their link graphs and other metadata, hosted on Amazon EC2, was announced today by the Common Crawl Foundation. “It is crucial [in] our information-based society that Web crawl data be open and accessible to anyone who desires to utilize it,” writes Foundation director Lisa Green on the organization’s blog.
The Foundation is an organization dedicated to leveraging the falling costs of crawling and storage for the benefit of “individuals, academic groups, small start-ups, big companies, governments and nonprofits.” It’s lead by Gilad Elbaz, the forefather of Google AdSense and the CEO of data platform startup Factual. Joining Elbaz on the Foundation board is internet public domain champion Carl Malamud and semantic web serial entrepreneur Nova Spivack. Director Lisa Green came to the Foundation by way of Creative Commons.
The Foundation explains the scope of the project thusly.
“Common Crawl is a Web Scale crawl, and as such, each version of our crawl contains billions of documents from the various sites that we are successfully able to crawl. This dataset can be tens of terabytes in size, making transfer of the crawl to interested third parties costly and impractical. In addition to this, performing data processing operations on a dataset this large requires parallel processing techniques, and a potentially large computer cluster.
“Luckily for us, Amazon’s EC2/S3 cloud computing infrastructure provides us with both a theoretically unlimited storage capacity coupled with localized access to an elastic compute cloud.”
The organization was formed three years ago, just now started talking about itself publicly and believes that free access to all this information could lead to “a new wave of innovation, education and research.”
Open Web Advocate James Walker agrees: “An openly accessible archive of the web – that’s not owned and controlled by Google – levels the playing field pretty significantly for research and innovation.”