New 5 Billion Page Web Index with Page Rank Now Available for Free from Common Crawl Foundation

A freely accessible index of 5 billion web pages, their page rank, their link graphs and other metadata, hosted on Amazon EC2, was announced today by the Common Crawl Foundation. “It is crucial [in] our information-based society that Web crawl data be open and accessible to anyone who desires to utilize it,” writes Foundation director Lisa Green on the organization’s blog.

The Foundation is an organization dedicated to leveraging the falling costs of crawling and storage for the benefit of “individuals, academic groups, small start-ups, big companies, governments and nonprofits.” It’s lead by Gilad Elbaz, the forefather of Google AdSense and the CEO of data platform startup Factual. Joining Elbaz on the Foundation board is internet public domain champion Carl Malamud and semantic web serial entrepreneur Nova Spivack. Director Lisa Green came to the Foundation by way of Creative Commons.

The Foundation explains the scope of the project thusly.

“Common Crawl is a Web Scale crawl, and as such, each version of our crawl contains billions of documents from the various sites that we are successfully able to crawl. This dataset can be tens of terabytes in size, making transfer of the crawl to interested third parties costly and impractical. In addition to this, performing data processing operations on a dataset this large requires parallel processing techniques, and a potentially large computer cluster.

“Luckily for us, Amazon’s EC2/S3 cloud computing infrastructure provides us with both a theoretically unlimited storage capacity coupled with localized access to an elastic compute cloud.”

The organization was formed three years ago, just now started talking about itself publicly and believes that free access to all this information could lead to “a new wave of innovation, education and research.”

Open Web Advocate James Walker agrees: “An openly accessible archive of the web – that’s not owned and controlled by Google – levels the playing field pretty significantly for research and innovation.”

New 5 Billion Page Web Index with Page Rank Now Available for Free from Common Crawl Foundation

Most Popular Gambling Stories

Latest News

White House betting controversy, Kalshi legal fight, 44 AGs oppose CFTC proposal, today in prediction market news LIVE

Walz orders ethics clampdown targeting state employees using prediction market information

NCLGS urges states keep control as prediction markets debate intensifies

Judge rejects Kalshi emergency appeal bid over New York contracts fight

Popular Topics

New 5 Billion Page Web Index with Page Rank Now Available for Free from Common Crawl Foundation

About ReadWrite’s Editorial Process

Related News

Most Popular Gambling Stories

Latest News

White House betting controversy, Kalshi legal fight, 44 AGs oppose CFTC proposal, today in prediction market news LIVE

Popular Topics<img width="16" height="17" src="https://readwrite.com/wp-content/themes/twentytwentyone-child/images/Arrow-right.svg" alt="Arrow right.svg"/>

Popular Topics