Semantic Web Patterns: A Guide to Semantic Technologies

In this article, we’ll analyze the trends and technologies that power the Semantic Web. We’ll identify patterns that are beginning to emerge, classify the different trends, and peak into what the future holds.

In a recent interview Tim Berners-Lee pointed out that the infrastructure to power the Semantic Web is already here.
ReadWriteWeb’s founder, Richard MacManus, even picked it to be the number one trend in 2008. And rightly so. Not only are the bits of infrastructure now in
place, but we are also seeing startups and larger corporations working hard to deliver end user value on top of this sophisticated set of technologies.

Editor’s note: Looking back over 2008, there were some posts on ReadWriteWeb that did not get the attention we felt they deserved – whether because of timing, competing news stories, etc. So in this end-of-year series, called Redux, we’re resurrecting some of those hidden gems. This is one of them, we hope you enjoy (re)reading it!

The Semantic Web means many things to different people, because there are a lot of pieces to it.
To some, the Semantic Web is the web of data, where information is represented in RDF and OWL. Some people replace RDF with Microformats. Others think that the Semantic Web is about web services, while for many it is about artificial intelligence –
computer programs solving complex optimization problems that are out of our reach. And business people always redefine the problem in terms
of end user value, saying that whatever it is, it needs to have simple and tangible applications for consumers and enterprises.

The disagreement is not accidental, because the technology and concepts
are broad. Much is possible and much is to be imagined.

1. Bottom-Up and Top-Down

We have written a lot about the different approaches to the Semantic Web –
the classic bottom-up approach and the new top-down one. The bottom-up approach
is focused on annotating information in pages, using RDF, so that
it is machine readable. The top-down approach is focused on leveraging information
in existing web pages, as is, to derive meaning automatically. Both approaches are
making good progress.

A big win for the bottom-up approach was recent announcement from Yahoo!
that their search engine is going to support RDF and microformats. This is a win-win-win
for publishers, for Yahoo!, and for customers – publishers now have an incentive to
annotate information because Yahoo! Search will be taking advantage of it, and users
will then see better, more precise results.

Another recent win for the bottom-up approach was the announcement of the Semantify web service
from Dapper (previous coverage). This offering will enable publishers to add semantic annotations to
existing web pages. The more tools like Semantify that pop up, the easier it will be for publishers
to annotate pages. Automatic annotation tools combined with the incentive to annotate
the pages is going to make the bottom-up approach more compelling.

But even if the tools and incentive exist, to make the bottom-up approach widespread
is difficult. Today, the magic of Google is that it can understand information as is, without asking
people to fully comply with W3C standards of SEO optimization techniques. Similarly, top-down semantic
tools are focused on dealing with imperfections in existing information. Among them are the natural
language processing tools that do entity extraction – such as the Calais and TextWise APIs that recognize people, companies,
places, etc. in documents; vertical search engines, like ZoomInfo and Spock, which mine the web for people;
technologies like Dapper and BlueOrganizer, which recognize objects in web pages; and Yahoo! Shortcuts,
Snap and SmartLinks, which recognize objects in text and links.

[Disclosure: Alex Iskold is founder and CEO of AdaptiveBlue, which makes BlueOrganizer and SmartLinks.]

Top-down technologies are racing forward despite imperfect information. And,
of course, they benefit from the bottom-up annotations as well. The more annotations there are,
the more precise top-down technologies will get – because they will be able to take
advantage of structured information as well.

2. Annotation Technologies: RDF, Microformats, and Meta Headers

Within the bottom-up approach to annotation of data, there are several
choices for annotation. They are not equally powerful, and in fact each approach is a trade off
between simplicity and completeness. The most comprehensive approach is
RDF – a powerful, graph-based language for declaring things, and attributes
and relationships between things. In a simplistic way, one can think of RDF
as the language that allows expressing truths like: Alex IS human (type expression),
Alex HAS a brain (attribute expression), and Alex IS the father of Alice, Lilly, and Sofia (relationship expression).
RDF is powerful, but because it is highly recursive, precise, and mathematically sound, it is also complex.

At present, most use of RDF is for interoperability. For example, the medical community uses
RDF to describe genomic databases. Because the information is normalized, the databases that
were previously silos can now be queried together and correlated. In general, in addition to
semantic soundness, the major benefit of RDF is interoperability and standardization, particularly
for enterprises, as we will discuss below.

Microformats offer a simpler approach by adding semantics to existing HTML
documents using specific CSS styles. The metadata is compact and is embedded inside
the actual HTML. Popular microformats are hCard, which describes personal and company
contact information, hReview, which adds meta information to review pages, and hCalendar,
which is used to describe events.

Microformats are gaining popularity because of their simplicity, but they are still quite limiting.
There is no way to describe type hierarchies, which the classic semantic community would
say is critical. The other issue is that microformats are somewhat cryptic, because the focus
is to keep the annotations to a minimum. This, in turn, brings up another question of whether
embedding metadata into the view (HTML) is a good idea. The question is: what happens
if the underlying data changes when someone makes a copy of the HTML document?
Nevertheless, despite these issues, microformats are gaining popularity because they are simple.
Microformats are currently used by Flickr, Eventful, and LinkedIn; and many other companies are looking
to adopt microformats, particularly because of the recent Yahoo! announcement.

An even simpler approach is to put meta data into the meta headers. This approach
has been around for a while and it is a shame that it has not been widely adopted.
As an example, the New York Times recently launched extended annotations for its news pages.
The benefit of this approach is that it works great for pages that are focused on a topic or
a thing. For example, a news page can be described with a set of keywords, geo location,
date, time, people, and categories. Another example would be for book pages.
O’Reilly.com has been putting book information into the meta headers, describing the author,
ISBN, and category of the book.

Despite the fact that all these approaches are different, they are also somewhat complementary; and each of them is helpful. The more annotations there are in web pages, the more
standards are implemented, and the more discoverable and powerful the information becomes.

3. Consumer and Enterprise

Yet another dimension of the conversation about the Semantic Web is the focus
on consumer and enterprise applications. In the consumer arena
we have been looking for a Killer App – something that delivers tangible and
simple consumer value. People simply do not care that a product is built on the
Semantic Web; all they are looking for is utility and usefulness.

Up until recently, the challenge has been that the Semantic Web focused on
rather academic issues – like annotating information to make it machine-readable.
The promise was that once the information is annotated and the web becomes one big
giant RDF database, then exciting consumer applications would come. The skeptics, however,
have been pointing out that first there needs to be a compelling use case.

Some consumer applications based on the Semantic Web: generic and vertical search,
contextual shortcuts and previews, personal information management systems, semantic
browsing tools. All of these applications are in their early days and have a long way to go before being truly compelling for the average web user.
Still, even if these applications succeed, consumers will not be interested in knowing about the
underlying technology – so there is really no marketing play for the Semantic Web in the consumer space.

Enterprises are a different story for a couple of reasons. First, enterprises are much more used
to techno speak. To them utilizing semantic technologies translates into being intelligent
and that, in turn, is good marketing. ‘Our products are better and smarter because we use the
Semantic Web’ sounds like a good value proposition for the enterprise.

But even above the marketing speak, RDF solves a problem of data interoperability
and standards. This “Tower of Babel” situation has been in existence since the early
days of software. Forget semantics; just a standard protocol, a standard way to pass around
information between two programs, is hugely valuable in the enterprise.

RDF offers a way to communicate using XML-based language, which on top of it has sound
mathematical elements to enable semantics. This sounds great, and even the complexity of RDF is
not going to stop enterprises from using it. However, there is another problem that might stop it – scalability.
Unlike relational databases, which have been around for ages and have been optimized and tuned,
XML-based databases are still not widespread. In general, the problem is in the scale and
querying capabilities. Like object-oriented database technologies of the late ’90s,
XML-based databases hold a lot of promise, but we have yet to see them in action in a big way.

4. Semantic APIs

With the rise of Semantic Web applications, we are also seeing the rise
of Semantic APIs. In general, these web services take as an input unstructured information
and find entities and relationships. One way to think of these services is mini natural language
processing tools, which are only concerned with a subset of the language.

The first example is the Open Calais API from Reuters that we have covered in two articles here and here.
This service accepts raw text and returns information about people, places, and companies found in the document.
The output not only returns the list of found matches, but also specifies places in the document where
the information is found. Behind Calais is a powerful natural language processing technology developed
by Clear Forest (now owned by Reuters), which relies on algorithms and databases to extract entities out of text. According to
Reuters, Calais is extensible, and it is just a matter of time before new entities will be added.

Another example is the SemanticHacker API from TextWise, which is offering a one million dollar prize for the best commercial semantic
web application developed on top of it. This API classifies information in documents into categories called semantic signatures.
Given a document, it outputs entities or topics that the document is about. It is kind of like Calais, but
also delivers a topical hierarchy, where the actual objects are leafs.

Another semantic API is offered by Dapper – a web service which facilitates the extraction of
structure from unstructured HTML pages. Dapper works by enabling users to define attributes
of an object based on the bits of the page. For example, a book publisher might define where the
information about author, ISBN and number of pages is on a typical book page and the Dapper application
would then create a recognizer for any page on the publisher site and enable access to it via
REST API.

While this seems backwards from an engineering point of view, Dapper’s technology
is remarkably useful in the real world. In a typical scenario, for websites that do not have clean APIs to
access their information, even non-technical people can build an API in minutes with Dapper.
This is a powerful way of quickly turning websites into web services.

5. Search Technologies

Perhaps the first significant blow to the Semantic Web has been the inability thus far to improve search.
The premise that a semantic understanding of pages leads to vastly better search
has yet to be validated. The two main contenders, Hakia and PowerSet, have made some progress, but not enough.
The problem is that Google’s algorithm, which is based on statistical analysis, deals just fine with
semantic entities like people, cities, and companies.
When asked What is the capital of France? Google returns a good enough answer.

There is a growing realization that marginal improvement in search might not be
enough to beat Google or to declare search the killer app for the Semantic Web.
Likely, understanding semantics is helpful but not sufficient to build a better search engine.
A combination of semantics, innovative presentation, and memory of who the user is, will be
necessary to power the next generation search experience.

Alternative approaches also attempt to overlay semantics on top of the search results.
Even Google ventures into verticals by partitioning the results into different categories.
The consumer can then decide which type of answer they are interested in.

Yet search is a game that is far from won and a lot of semantic companies are really trying to
raise the bar. There may be another twist to the whole search play – contextual technologies,
as well as semantic databases, could lead to qualitatively better results. And so we turn to
these next.

6. Contextual Technologies

We are seeing an increasing number of contextual tools entering the consumer market.
Contextual navigation does not just improve search, but rather shortcuts it.
Applications like Snap or Yahoo! Shortcuts, and SmartLinks “understand”
the objects inside text and links and bring relevant information right into the user’s context.
The result is that the user does not need to search at all.

Thinking about this more deeply, one realizes that contextual tools leverage semantics
in a much more interesting way. Instead of trying to parse what a user types into
the search box, contextual technologies rely on analyzing the content. So the meaning
is derived in a much more precise way – or rather, there is less guessing. The contextual tools
then offer the users relevant choices, each of which leads to a correct result. This is fundamentally
different from trying to pull the right results from a myriad of possible choices resulting from a
web search.

We are also seeing an increasing number of contextual technologies make their way into the browser. Top-down semantic technologies need to
work without publishers doing anything; and so to infer context, contextual technologies integrate into
the browser. Firefox’s recommended extensions page features a number of contextual browsing solutions –
Interclue, ThumbStrips,
Cooliris, and BlueOrganizer (from my own company).

The common theme among these tools is the recognition of information and the creation of specific
micro contexts for the users to interact with that information.

7. Semantic Databases

Semantic databases are another breed of semantic applications focused on annotating
web information to be more structured. Twine, a product of Radar Networks and currently in private beta,
focuses on building a personal knowledge base. Twine works by absorbing unstructured content in various
forms and building a personal database of people, companies, things, locations, etc. The content is sent to Twine
via a bookmarklet, via email, or manually. The technology needs to evolve more, but
one can see how such databases can be useful once the kinks are worked out. One of the very powerful applications
that could be built on top of Twine, for example, is personalized search – a way to filter the results of any search engine based
on a particular individual.

It is worth noting that Radar Networks has spent a lot of time getting the infrastructure right. The underlying
representation is RDF and is ready to be consumed by other semantic web services. But a big chunk of the core
algorithms, the ones that are dealing with entity extraction, are being commoditized by Semantic Web APIs. Reuters offers this as an API call, for example, and so moving forward, Twine won’t need to be concerned with how to do that.

Another big player in the semantic databases space is a company called Metaweb, which created Freebase.
In its present form, Freebase is just a fancier and more structured version of Wikipedia – with RDF inside and less information
in total. The overall goal of Freebase, however, is to build a Wikipedia equivalent of the world’s information.
Such a database would be enormously powerful because it could be queried exactly – much like relational databases. So once
again the promise is to build much better search.

But the problem is, how can Freebase keep up with the world? Google indexes the Internet daily and grows together with the web.
Freebase currently allows editing of information by individuals and has bootstrapped by taking in parts of Wikipedia and other
databases, but in order to scale this approach, it needs to perfect the art of continuously taking in unstructured information
from the world, parsing it, and updating its database.

The problem of keeping up with the world is common to all database approaches, which are effectively silos. In the case of Twine,
there needs to be continuous influx of user data, and in the case of Freebase there needs to be influx of data from the web.
These problems are far from trivial and need to be solved successfully in order for the databases to be useful.

Conclusion

With any new technology it is important to define and classify things. The Semantic Web is offering an exciting promise: improved information discoverability, automation of complex searches, and innovative web browsing. Yet the Semantic
Web means different things to different people. Indeed, its definitions in the enterprise and consumer spaces are different,
and there are different means to a common end – top-down vs. bottom-up and microformats vs. RDF. In addition to these patterns,
we are observing the rise of semantic APIs and contextual browsing tools. All of these are in their early days but hold a big
promise to fundamentally change the way we interact with information on the web.

What do you think about Semantic Web Patterns? What trends are you seeing and which applications are you waiting for? And if you work with semantic technologies in the enterprise, please share your experiences with us in the comments below.