Tags as Far as the Eye Can See: New York Times to Publish Index as Linked Data

Today, at the Semantic Technology Conference, Rob Larson and Evan Sandhaus of the New York Times announced together that the Times will soon be publishing its copious index as Linked Data.

The Times‘ data will join content from Project Gutenberg, a vast online library of text from public domain books, data from the U.S. census, and information from many other formative and vital entities in the semantic web space. Larson and his team intend to make available hundreds of thousands of tags for content dating back to 1851. This will providing give developers an invaluable, automatically navigable roadmap for the publication’s vast directory of knowledge and will link that data to existing pages, people, and content around the web.

In his keynote address, Larson emphasized “How deeply we [at the Times] care about metadata.”

“It’s been fundamental to what we do for a long time. We feel we’re good at it, but our content is an island… we want to announce our intention to publish our thesaurus to the community under a license that will allow you to use it and contribute your improvements… The results of this effort will in time take the shape of the Times entering this Linked Data cloud. This is wholly consistent with our open strategy… to facilitate access to slices of our data for those who want to include it in their applications.”

Larson likened the Times corpus to a quarry of data. He said that the newspaper’s API provided the picks and shovels to mine data, and the Linked Data initiative would be the map.

The timing, licensing, format, and other factors of the project are yet to be determined.

This announcement comes on the heels of CNET’s partnership with Reuters to publish data to the Linked Data cloud. Moreover, exactly one month ago, we wrote that Linked Data was a concept “whose time has come” and gave a thorough overview of the concepts and standards it entails, for curious readers who would like to drill deeper on the subject.

In another recent interview, Sandhaus detailed the tagging process for the Times‘ corpus, both for print and online articles:

“There are two types of tagging that go on at the times… Every day, indexers take the paper and go article by article and associate each article with subject keywords. Then they manually summarize it. It’s like a Google list, but in dead tree form.

Another type of tagging we do is… when an article goes from the newsroom to the web, it’s put there by a producer who will augment the article with any number of rich features like images, multimedia… and subject keywords. Unlike the indexers, who do this completely by hand, the producers are assisted in their tagging by an automated classification system which suggests tags to be applied to the data and which are ultimately approved by the producer.”

An official announcement is expected at the Times‘ Open blog tomorrow, with details on the project to follow.