By Nitin Karandikar Much has been written recently about the concepts, approaches and applications of the Semantic Web. But there's something missing. In terms of understanding, finding and displaying content, there is no doubt that the Semantic Web is slowly becoming real (e.g. there were some great demos at a recent SDForum meet ). However, there is a gap emerging with Content Authoring tools, which have not yet made this paradigm shift.

On the one hand, most authors are comfortable with, and proficient in, desktop authoring tools such as Microsoft Word, FrontPage, Adobe GoLive and others. This is especially true for professionals and other experts who create technical reference content for web applications, such as legal references, accounting manuals or engineering documents. The current crop of authoring tools produce visually high-quality articles and web pages, but their XML creation capabilities are severely limited.

On the other hand, parsing Word documents or HTML web pages to extract meaningful XML out of them gives poor results; much of the semantic knowledge of the content is lost. There do not appear to be any popular tools that create Semantic content natively and yet are natural and easy for a content author to use.

Top-Down? Or Bottom-Up?

Of course, there are ways to get around this issue to some extent. Allowing authors or readers to add tags to articles or posts allows a measure of classification, but it does not capture the true semantic essence of the document. Automated Semantic Parsing (especially within a given domain) is on the way - a la Spock, twine and Powerset - but it is currently limited in scope and needs a lot of computing power; in addition, if we could put the proper tools in the authors' hands in the first place, extracting the semantic meaning would be so much easier.

For example, imagine that you are building an online repository of content, using paid expert authors or community collaboration, to create a large number of similar records - say, a cookbook of recipes, a stack of electrical circuit designs, or something similar. Naturally, you would want to create domain-specific semantic knowledge of your stack at the same time, so that you can classify and search for content in a variety of ways, including by using intelligent queries.

Ideally, the authors would create the content as meaningful XML text, so that parsing the semantics would be much easier. A side benefit is that this content can then be easily published in a variety of ways and there would be SEO benefits as well, if search engines could understand it more easily. But tools that create such XML, and yet are natural and easy for authors to use, don't appear to be on their way; and the creation of a custom tool for each individual domain seems a difficult and expensive proposition.


Image: andrea.paiola

Car Review Example

As a more concrete example: imagine that you control a web site called New-Car-Reviews.com, a hypothetical site that reviews new cars; you pay expert authors to write reviews of new car models every year for this site. Unlike other automobile characteristics, reviews cannot be easily stored into a database and queried. Conceptually, your reviews are similar to this review for the 2008 Volvo S40 2.4i sedan on the automotive site Kelley Blue Book.

Imagine this: when your authors are originally composing this review, instead of writing the content as

    <span id="ctl00">You'll Like This Car If...</span>
        ...description_positive...
    <span id="ctl00">You May Not Like This Car If...</span>
       ...description_negative...

if they could instead create it as

    <advantages><label>You'll Like This Car If...</label>
        <text>...description_positive...<text>
    </advantages>
    <disadvantages><label>You May Not Like This Car If...</label>
        <text>...description_negative...<text>
    </disadvantages>

In other words, you get more value out of the same exact content:

  (a) You can easily re-purpose the content in additional ways, such as for mobile devices, RSS feeds, web services APIs, mashups and so on;
  (b) As search engines start to take advantage of semantic notation, you get SEO benefits;
  (c) You can provide users with ways to query the content intelligently ("show me cars which are family-friendly AND don't roll over easily vs those that work better off-road AND seat 7"), using the recently-released SPARQL.

As a content publisher, you want your content to be found and used as much as possible, and making it meaning-enabled is a big step in this direction. At the same time, you cannot ask authors to use a pure XML tool such as XMLSpy; and MS Word creates unreadable XML that specifies formatting rather than semantics.

A solution for this specific example already exists: Microformats could be applied to handle the problem of annotating the advantages and disadvantages. While the Microformat solution works very well for specific types of information - such as for describing people and addresses - it is too limited to be applicable in a general way to add semantic information to web content at large.

So the general problem must be solved if we are to see large-scale adoption of the Semantic Web. It would be a boon to expert authors everywhere, including those who create news articles for the newspaper publishing industry. But there do not seem to be any solutions on the horizon, in terms of technologies, tools or processes to promote the creation of more meaning-rich content.

Reactions: But is there a Business Case?

When I put this question to a group of prominent bloggers and industry thought-leaders in the Semantic Web space, the results were not encouraging. There does not seem to be much interest in building Semantic authoring tools. The main stumbling block is the lack of a clear business model for publishers to embrace this approach.

Jeremy Liew of Lightspeed Venture Partners has recently penned a series of articles focused on Semantic Web: Meaning = Data + Structure, based on user-generated structure; domain knowledge and user behavior, which focuses on the problem of inferring meaning from content.

He questions the business rationale for authors to take the effort to add XML markup to their content, and points to domain-specific extraction approaches as the more likely solution:

"The challenge with getting most authors to markup in XML is not just one of tools, but also of motivation IMO. Unless and until a clear business case advantage justifies the additional effort required, and that advantage is greater than other projects offer, you won't see much semantic markup except from academics and others whose interests are more philosophically driven than business driven.

That is why I think the domain specific extraction approaches will likely be more prevalent - the business advantage of better search and structure accrues to the person doing the extraction, and because it is domain specific, the additional effort is lessened."

He's right, of course; domain-specific extraction approaches are definitely going to be popular, and are beginning to take off already. It provides significant added value for the extractor. However, it's difficult and expensive to do it well, so the business case is somewhat dubious for the early adopters.

ReadWriteWeb's own Alex Iskold is another thought leader in this space. He has a series of fantastic articles about the Semantic Web, including the problem of annotating data, the different approaches used, and a primer for the structured web.

His comments echoed those of Liew:

"There seems to be little incentive for publishers to annotate information.

The problem is that if you go deep enough you hit RDF. The light version is Microformats. But the issue is not the format, its the incentive."

Tim O'Reilly wrote about this issue almost a year ago: Different Approaches to the Semantic Web, in which he echoes the same sentiment:

"It seems easy enough, but why hasn't this approach taken off? Because there's no immediate benefit to the user. He or she has to be committed to the goal of building hidden structure into the data. It's an extra task, undertaken for the benefit of others. And as I've written before, one of the secrets of success in Web 2.0 is to harness self-interest, not volunteerism, in a natural 'architecture of participation.'"

Conclusion

I guess I'm a minority of one. In my view if content creators could add semantic meaning while constructing the content in the first place (which is, conceptually, only marginally more difficult for the authors), then the value of the content would increase exponentially at very low cost. That seems like a defensible business case for content publishers.

The business case for publishers to annotate existing web pages and content is certainly very weak. But for new content, if you're creating it for your site anyway, why wouldn't you add semantic markup to make it more findable and usable?

What do you think? Please leave a comment below.

Top image credit: nennett