By Alex Iskold

We've been writing recently about the rise of semantic web and how in 2007 we'll see many interesting semantic technologies. The fundamental problem that all these technologies need to solve is explaining the meaning of things to computers. There are several approaches to this, all of which in principle can work.

There are companies and technologies that are doing it bottom up - by embedding semantical annotations (meta-data) right into the data. The opposite camp is exploring the top-down approach, which relies on analyzing existing information. The ultimate top-down solution would be a fully blown natural language processor, which is able to understand text like people do.

In this post, we are going to look at ClearForest - one of the companies in the top-down camp. At first glance, you might not think much of the company's web site, but a deeper dive reveals that ClearForest is restructuring - to apply its core natural language processing technology to facilitate next generation semantic applications. The fact that ClearForest has released both a Web Service and a Firefox extension that leverages an API to deliver the end-user application, says that the company gets what the next generation web is all about.

Gnosis - Firefox extension for annotating web pages with semantics

The first Clear Forest product that we looked at was the Firefox extension called Gnosis. Here is how it is described on the Mozilla extensions page:

"With a single click, Gnosis will identify the people, companies, organizations, geographies and products on the page you are viewing. Using the built-in navigation sidebar you can gain immediate understanding of the page’s contents."

Downloading and installing Gnosis was as easy as any Firefox add-on. We used the Read/WriteWeb home page to try the extension. With one click from the menu, the page was filled with various types of annotations. The current version of Gnosis recognized Companies, Countries, Industry Terms, Organizations, People, Products and Technologies - an impressive range of things. Each word that Gnosis recognized, got colored according to the category.

This was interesting, but overwhelming. A better approach would be to have the coloring appear on a mouse over or another gesture. But this is a usability nuance that will get polished in the next iteration on the product. Overall, I was impressed. At an instance, the page was analyzed and annotated. It was not perfect (it thoughts that all the Jasons on the page were Jason Briggs), but it was more accurate than I expected it to be.

Next I turned my attention to the sidebar. The extension created a categorised tree of all words and phrases that it found on the page. We could expand and collapse each category to find the terms. It looked like vertical search for a single page. It was interesting and is very useful for blogs and lengthy pages.

Again, the interface needs to evolve - but the idea that key terms and concepts on any page can be identified and organized in such a way seems compelling. In addition to the organization, the extension offered to search for any keyword on Google, Wikipedia or Technorati. If you are interested in a keyword, you are likely to want to find more related information. So the context search seems like a logical extension of categorisation, as it makes this data further searchable.

Overall, this seemed unpolished but intriguing. The question is, how does this work? The Firefox page stated that this extension is based on a web service. So this is what I want to explore next...

ClearForests’s Semantic Web Service (SWS)

Behind every great service there in an API. Modern web companies have re-discovered an old software engineering wisdom - interfaces are a powerful way to build complex software. Today we are seeing the rise of the most complex software system yet - a service powered web. ClearForest has also recognized the value (both can be monetized independently) of building a product on top of a service. Gnosis leverages the interface to offer a powerful natural language processing service.

The Semantic Web Service (perhaps the name is a bit broad) offers the SOAP interface for analyzing text, documents and web pages. The service returns the categorization and annotation information which can be further leveraged by consumer facing applications (the company recommends building mashups). I am fairly certain that SWS is powered by a web crawler, because it is able to recognize people like Richard MacManus, Jason Biggs and Alex Iskold. My guess is that the crawler is used to build a giant index, that is then used by the document parser to annotate the terms in the document.

The service right now is free to try, but you need to contact ClearForest to use it commercially. To encourage the usage of the service the company announced a mashup contest. The contest was advertised on ProgrammableWeb and ended December 11th. It is not clear to me that it was successful, as there are no announcements of winners and no showcase - but it certainly seems like a creative way to promote the new API.

Conclusion

Clear Forest might not have a glamorous/Ajaxy web site and might not have a polished product yet. But it is a company that has been around and has been backed by top tier VC firms. Both the approach and technology are worth attention and consideration. Their natural language processing technology, first applied to business data mining, is able to clearly distill useful information. To offer it as a service shows the insight and the understanding of the new market opportunities (think Amazon). And to create a Firefox extension that showcases the technology demonstrates their desire and the readiness to go mainstream. 

All these factors indicate that Clear Forest is worth watching. And it is yet another brick to support the top-down semantic web approaches. Let us know what you think about this company.