We have written a lot here about the the vision of building a structured layer on
top of the current web. Annotating billions of HTML documents in a bottom-up way or building top-down tools that can automagically
interpret the existing information are the two approaches that we discussed. Together these approaches would result in a global
database which will make the web even more connected.
The ability to correlate content and concepts accross web sites would reduce the time necessary for searching and would enable the discovery of related information.
In previous posts we discussed the difficulties with the bottom-up approach
to the Semantic Web – a sophisticated form of annotating information using tools like RDF and OWL. Among the factors that impair the web wide adoption of
these tools is complexity and the lack of clear end user benefits.
On the other hand, the top-down approach
that we discussed does not place any burden on content owners and delivers instant benefits to end users. Yet, the top-down tools run
into a difficulty – interpreting raw information is not that simple. Typical solutions focus on a vertical, but
still suffer from imperfections.
What if there was some minimal annotation in the content to help top-down tools interpret it?
In this post we look at how content owners can implement simple annotation strategies which can help the top-down
tools and search engines to make the web more structured.
Annotation Basics – Headers
It is striking how many sites today do not use meta tags in the head of the document to provide the bare minimum information about a page’s content.
Forget building a smarter web, this is just plain bad SEO practice. The work that is being put into generating great content can be offset by lack of a succinct, meaningful description
of that content. Every page on the web should have the following information filled in:
- title – a sentence briefly describing the site/page
- description – a paragraph about the site/page
- keywords – a list of keywords that describe the site/page
Note that it makes sense to provide different information for the root page and subsequent pages. For example, for a newspaper
or a blog, the root page should provide information about the site at large, while individual article and post pages should contain
information about that specific page, not the overall site.
The New York Times’ web site provides a good example of how to properly use meta tags. For example, this article on Slowdown in US Growth
includes the following meta data:
- title – U.S. Growth Slowed Drastically in 4th Quarter
- description – The economy expanded by a weak 0.6 percent in the latest indication of a substantial slowdown and perhaps a recession.
- keywords – United States Economy,Gross Domestic Product
The New York Times is actually a great example of taking the basics of annotation and building on top of them. Each page includes an extended set of rich meta data including, the author of the article, the date it was published, thumbnail image URL, creator,
category and even ticker symbols for public companies that are mentioned in the article. Certainly, the New York Times provides a really great set of information,
perhaps even wider than needed for most content, but lets focus on the ones that should be used on a wider scale.
author: Web content is produced by people and for people. With the rise of social culture we are increasingly interested
in finding bits of everyone’s identity around the web. If something piqued your interest enough for you to blog or to write an article, at least you can
put your name on it. Having people attached to content would allow seamless navigation from one to another. There is already a standard
meta tag for this, with a suggestive name: author.
thumbnail: We love pictures. Since the launch of Flickr we can’t live without them. Facebook’s success owes a lot
to photo sharing. With bandwidth becoming cheap, we are increasingly become more visual. We do not want text we want pictures, so if a news
article or blog post contains an image, it is simple to do what the Times did – generate a meta tag for it. There is no standard meta
as far as I know, but any of these would do: thumb, image, picture, thumbnail, etc.
date: As we are becoming a real-time culture the freshness of content becomes paramount. Tagging the page with date is important way of helping classify the page in time.
Most blog posts and articles contain dates anyways, and having a standard date header would make it simple and obvious.
location: Location is becoming increasingly more important as well. With GPS and widely available Internet access we are able to
easily let people know where we are and are able to take advantage of local services. If the article or a post is related to a specific location
there is a conventional way of annotating it. The technical term for annotating content with location information is Geotagging.
It generally means placing a pair of latitude and longtitude coordinates. A more relaxed form would be specifying country/region/city and
is described in detail by the Geo microformat specification. While specifying exact
position coordinates may be difficult, even something as simple as the geo header New York, NY would be very helpful.
Tags in Blog Posts
The concept of tagging, which was popularized by services like del.icio.us and Flickr, is now
commonly understood and is ubiquitous. The idea of humans tagging content to categorize it and later to find it is a simple, yet
important bit of the web infrastructure. Most major blogging platforms support tags. The tags are standardized based on the
rel-tag microformat. You can see the implementation on ReadWriteWeb – each
post is tagged with a set of tags.
For example, one of our recent posts contains this tag:
<a href=”http://www.readwriteweb.com/tag/twitter” rel=”tag”>twitter</a>
The tag has several benefits:
- Readers can instantly click to find other posts with this tag
- Search engines can better classify the content
- Semantic tools can offer additional services such as finding related content, pictures, and video
Tags are similar in principle to keywords, but provide more flexibility because they are inside the post and can have richer
content. In principle, it could be possible to add more information into the keywords meta tag in the head of the document but it has existed in its current form for several decades and is thus probably not likely to change. In any case, all modern blogging platforms make it
trivial to tag content, so there should be no excuses.
Standardizing Blog Templates Across Platforms
In the nineties people created web sites. These days only companies have web sites, individuals
have blogs and social network profiles. There is a great opportunity to standardize and structure the information because blogs
and profiles are based on templates. Consider a common structure for each blog. One or a few sidebars and the central area for
the content. In the content area, on a post page there is a post body, date, author and tags – a minimum set of elements.
Why not standardize on a few things here?
- <div class=”post”> – a container for the post body
- <div class=”sidebar”> – a container for the sidebar
- <div class=”author”> – a container for the author
- <div class=”date”> – a container for the date
- <div class=”tags”> – a container for the tags
- <div class=”comments”> – a container for the comments
Platforms already do have very similar things in place and standardizing between them is rather simple.
In no way would this be a competitive advantage or disadvantage to them, but it would be a big help towards making
the web more structured. Extending on these basics, it would also be helpful if widgets were
wrapped into standard enclosures. A simple widget tag can go a long way toward distinguishing widgets from
the other content in the sidebar.
If blogging platforms standardized on these basic conventions, likely major newspapers would follow as well.
The situation with social network profiles is different, as the information contained in them is not public.
In addition, there is a competitive advantage to Facebook in having its own proprietary structure. However,
entities like the DataPortability group have been created
precisely to deal with this problem and Facebook just joined. So we may yet seem some progress on that front.
Beyond Basics – Microformats
The annotations that we discussed up to now are very basic and would a
require minimum amount of work from newspapers, bloggers, and blogging platforms to deploy. The advantage of them is that they
are simple to implement but would deliver big bang for the buck. Yet, these are primitive ways to annotate content.
The next step is to use bottom-up technologies like microformats, which offers
a way to embed objects into HTML documents in a compact way.
Microformats have been around for a few years and have certainly caught the attention of some. Several major services are using microformats.
For example, Flickr is using the geo microformat and headers to geotag photos. Eventful uses the hCal format to describe meta data for each event.
Blogger pages contain hCards for each blogger. But the problem is that there needs to be more and better integration of
microformats into the blogging platforms. For example, coming back to the Blogger hCard, right now, most of them are not useful
because they do not require people to fill in information and just generate the card based on the login. This is more harmful than good
as semantic tools can not take advatange of such cards and they do not look good to people either.
Similarly, there is not much support for geotagging photos and event microformats in the platforms. But even beyond the lack of support, the limitation of the current
microformat specs is that they do not cover the basic range of things that people discuss on the web – books, music,
movies, recipes, and restaurants are all noticibly absent (the existing hReview microformat does not have a way to express the
type of the object or the attributes).
But it does look like with a bit of a push on both the community behind the microformat specs and blogging platforms
we could see microformats becoming a major way of annotating information inside blog posts. This would be a welcomed
development and would allow a large subset of the web – the blogosphere – to become quite structured.
The vision of the structured web is big and compelling and at the same time is hard to attain.
At times, it is difficult to see how we can ever get there. But on some days we think that even if
the web could be just a tiny bit more structured it would become so much more connected. And so in this
post we considered a set of very basic bottom-up techniques that newspapers, bloggers, and blogging platforms
can put in place to make the web more structured.
Putting meta information into page headers is easy and should be a must-do thing for everyone.
Beyond that, providing information such as author, date, and location makes data that much more valuable.
And if blogging platforms could also standardize on the key elements of the pages, crawlers and
intelligent browsing tools could do a better job making sense of the content. Beyond that, microformats
are the front runner in annotating the web with meta information about things, but they still
need more pushing and effort.
What do you think about these basic structures? Are you going to fix up your blog after reading this post?
What other things should we push to standardize on?