Today, a story on Techmeme caught our eye. It was entitled “We Need a Wikipedia for data,” and the article, written by X-Googler Bret Taylor, discussed the difficulty of finding open data sets on the internet, something which could spur innovation, allowing programmers to build new applications the likes of which have never been seen before. What was interesting about this story, in addition to, obviously, the concept of a Data Wiki itself, was the amazing and insightful commentary around this concept, not just on the blog, but all over the net, something which led to the discovery of some pretty good data sources that are already available.
In Bret’s story, he mentioned some of the common data sources currently available, like the US Census Bureau’s map data and the Reuters corpus, but his commenters came up with a few more. (See? This is why blog comments matter).
So what did everyone come up with? A lot of data sources are already freely available on the net, as it turns out, if you just know where to look. Here’s a summary, do you have anything to add?
CKAN (Comprehensive Knowledge Archive Network)
The CKAN site is a registry of open knowledge packages and projects. Here, you can find open knowledge resources or register one of your own. What kind of stuff can you find at CKAN? They mention a set of Shakespeare’s works, a global population density database, the voting records of MPs, or 30 years of US patents as some examples, but they also point you to some useful URLs, like flickr’s Creative Commons page, where photos can be searched by license type.
This project is attempting to assemble and interconnect the world’s best repository for raw data – like a giant, free, open almanac. The best way to describe it comes from MetaFilter, where the project was spotted recently: “Just as Wikipedia will help you find out something about everything, infochimps.org will help you find out everything about something.” What can you find there? Every wikipedia infobox, each infobox type in its own table, 50 years of global hourly weather data, all the tables from the US Census Statistical Abstract, oh and 100,000 official crossword words, too.
Not a data set in the traditional sense, but definitely a useful tool, OpenStreetMap is a free, editable map of the world where you can view, edit, and use your own geographical data. The project was started because most maps actually have legal or technical restrictions on their use.
A user-maintained community metadatabase
which collects music “metadata” like artist name, release title, list of tracks, etc. You can browse through the site or you can use a client program,
, to help identify music collections.
Dismissed by the blogosphere as a bad idea, if not downright evil, Jigsaw, the marketplace that pays you to give up other people’s contact info now boasts 7 million complete contacts for the taking.
This site is a community effort to extract structured info from Wikipedia and make that data publicly available on the web, essentially turning Wikipedia into a database you can query. Is this the beginnings of a semantic web? Check out their downloads section for the datasets and then scroll to the bottom for even more links to data sources on the web.
Freebase, an open, shared database of the world’s knowledge, received a lot of mentions in the comments, so this must be a good one. Community built and maintained, it pulls from open data sources like Wikipedia, MusicBrainz, and the SEC archives to create structured information on many topics, including more popular ones like movies, music, people, and locations. The site, unlike some of the others in this list, is also easy to navigate and well-designed, which makes it that much better to use.
Perhaps one of the less interesting items due to its dry subject matter – financial data – it’s certainly worth a mention because a free database of real-time and historical market data for trading systems and platforms is the kind of thing that really floats some people’s boats.
Thanks to LibraryThing, ThingISBN is the site’s first API, and even though its competitor became a paid service, ThingISBN is still free for non-commercial use. The API doesn’t just return the usual book data, but also something called “edition disambiguation,” meaning it also returns a list of “related” ISBNs—other editions, other media, and translations.
Like the title suggests, Numbrary is a library for numbers. This free service helps you find, use, and share numbers from public record data sets, like census data or the CIA World Factbook.
The Data Wrangling blog
This blog post lists a bunch, and I mean a bunch, of open datasets on the web, which just goes to show how much of a cursory list my post really is.