Progress on HTML5 Microdata Could Revolutionize Web Queries

The original, grand scheme for the Web was that information would be served up on multiple sites that would all link to one another. The world would be one, big encyclopedia. As it turned out, informational and educational sites have become massive repositories of articles and markup, some of which compete with one another; and news sites such as this one have become separate, self-standing feats of ingenuity.

The problem is that it’s not much of a Web any more, or at least less of one that was originally intended. This is a problem that HTML5’s architects are hoping to solve, by means of new systems that would enable services from the outside (like search engines) to make more and better sense of the data being maintained by servers on the inside.

The latest progress on this front was made this week with a new round of improvements, some posted just yesterday, to the WHAT Working Group’s “living standard” specifications for microdata. Think of it as markup for your markup: a way to specify the structure and attributes of long tables of data, or even long news sites full of articles, within the markup itself without having to force servers to implement third-party parsers.

In a developers’ post earlier this week, Opera Software’s Chris Mills describes microdata:

Microdata tries to improve on what we’ve already had in the past: providing a built-in mechanism that is as easy to grasp as microformats, but also allows data processing without needing to build your own parser. And you can of course build your own microdata processing functionality for non-supporting browsers using JavaScript, if needs be.

Already, HTML5 improves the structure of articles by enabling publishers to define separate characteristics of the components of articles, not just for CSS but to help browsers better organize pages. That could change how future browsers implement tabs (imagine automatic groupings and subgroupings). But microdata goes several steps further. It enables sites to explain the context of data using terms that browsers may be able to understand. Imagine using Google to search for dinner recipes, and having recipe cards delivered from several sites inside Google, fully and properly formatted in a nice, browsable Rolodex-like gadget.

In order for sites like Google to do this automatically, however, the publishers of things like recipes would need to agree upon some kind of common vocabulary – a single list of names that one calls the parts of recipes. And here’s where things get tricky, because you can’t invoke “Google” and “imagine” in the same sentence without some caveats. Two months ago, Google put its weight behind a competing standard called RDFa, which aims to do much the same thing. In concert with Microsoft’s Bing and with Yahoo, the search leader agreed to promote one Web site, called schema.org, for the promotion of a one-stop shop where search engines and other resources could determine the best single structures for shared data.

So the question becomes, is microdata part of HTML5 or is it RDFa? The W3C has already published specifications for embedding RDFa in HTML, but both W3C and WHATWG are working (not jointly, mind you) on microdata. Still, Web developers are treating RDFa as an HTML5 component, or at least as something playing in the HTML5 playground, for one reason because Google supports it.

Once again, a critical question around HTML5 will come down to a decision on whom to trust, the architects or the implementers. If schema.org takes off, then developers will assume that RDFa is, essentially, the de facto data specification component of HTML5, regardless of what the working groups say. That little victory could give Google more leverage on other aspects of HTML5 leadership, such as implementation of the VP8 video codec versus add-on support for H.264.

UPDATE: Incorporating some information from Manu Sporny, the RDFa chair for W3C, I should make some clarifications: Google’s strong support for RDFa was made official back in May 2009, when it announced support for a feature it called Rich Snippets. That tool helps search engines like Google better determine the relevancy of Web pages, and relies on RDFa.

However, as Sporny points out, despite Google’s history with RDFa, the schema that the Schema.org team has coalesced upon is actually microdata, as the site itself explains:

Your web pages have an underlying meaning that people understand when they read the web pages. But search engines have a limited understanding of what is being discussed on those pages. By adding additional tags to the HTML of your web pages–tags that say, “Hey search engine, this information describes this specific movie, or place, or person, or video”–you can help search engines and other applications better understand your content and display it in a useful, relevant way. Microdata is a set of tags, introduced with HTML5, that allows you to do this.