Home New 5 Billion Page Web Index with Page Rank Now Available for Free from Common Crawl Foundation

New 5 Billion Page Web Index with Page Rank Now Available for Free from Common Crawl Foundation

A freely accessible index of 5 billion web pages, their page rank, their link graphs and other metadata, hosted on Amazon EC2, was announced today by the Common Crawl Foundation. “It is crucial [in] our information-based society that Web crawl data be open and accessible to anyone who desires to utilize it,” writes Foundation director Lisa Green on the organization’s blog.

The Foundation is an organization dedicated to leveraging the falling costs of crawling and storage for the benefit of “individuals, academic groups, small start-ups, big companies, governments and nonprofits.” It’s lead by Gilad Elbaz, the forefather of Google AdSense and the CEO of data platform startup Factual. Joining Elbaz on the Foundation board is internet public domain champion Carl Malamud and semantic web serial entrepreneur Nova Spivack. Director Lisa Green came to the Foundation by way of Creative Commons.

The Foundation explains the scope of the project thusly.

“Common Crawl is a Web Scale crawl, and as such, each version of our crawl contains billions of documents from the various sites that we are successfully able to crawl. This dataset can be tens of terabytes in size, making transfer of the crawl to interested third parties costly and impractical. In addition to this, performing data processing operations on a dataset this large requires parallel processing techniques, and a potentially large computer cluster.

“Luckily for us, Amazon’s EC2/S3 cloud computing infrastructure provides us with both a theoretically unlimited storage capacity coupled with localized access to an elastic compute cloud.”

The organization was formed three years ago, just now started talking about itself publicly and believes that free access to all this information could lead to “a new wave of innovation, education and research.”

Open Web Advocate James Walker agrees: “An openly accessible archive of the web – that’s not owned and controlled by Google – levels the playing field pretty significantly for research and innovation.”

About ReadWrite’s Editorial Process

The ReadWrite Editorial policy involves closely monitoring the tech industry for major developments, new product launches, AI breakthroughs, video game releases and other newsworthy events. Editors assign relevant stories to staff writers or freelance contributors with expertise in each particular topic area. Before publication, articles go through a rigorous round of editing for accuracy, clarity, and to ensure adherence to ReadWrite's style guidelines.

Get the biggest tech headlines of the day delivered to your inbox

    By signing up, you agree to our Terms and Privacy Policy. Unsubscribe anytime.

    Tech News

    Explore the latest in tech with our Tech News. We cut through the noise for concise, relevant updates, keeping you informed about the rapidly evolving tech landscape with curated content that separates signal from noise.

    In-Depth Tech Stories

    Explore tech impact in In-Depth Stories. Narrative data journalism offers comprehensive analyses, revealing stories behind data. Understand industry trends for a deeper perspective on tech's intricate relationships with society.

    Expert Reviews

    Empower decisions with Expert Reviews, merging industry expertise and insightful analysis. Delve into tech intricacies, get the best deals, and stay ahead with our trustworthy guide to navigating the ever-changing tech market.