Extractiv has quietly launched a service that crawls the Web for text on a specific topic, then transforms it into “structured semantic data.” It’s a direct competitor to Thomson Reuters’ Calais product, which has been doing this for a couple of years now. This type of service is potentially valuable to media companies, search services and monitoring applications – because it turns messy, unorganized HTML content into data that is organized into categories and given other semantic ‘meaning.’

I sat down with Extractiv CEO Shion Deysarkar at the recent Semantic Technology conference in San Francisco, to find out how Extractiv intends to compete with the more well-known and big media backed Calais.
How Extractiv Works
Extractiv is a joint venture between Houston-based web crawling service 80legs and natural language processing company LCC (which created Swingly, a Q&A service).
Deysarkar explained that Extractiv uses technology from both of its parent companies, to crawl the Web for content on a particular topic and then – using natural language processing – transform it into structured data. This video, produced by Extractiv, explains how the service might be used to crawl the Web for stories about smart phones over the past month.
The output of the crawl and analysis can be JSON or XML, two formats commonly used for structured data. Support for RDFa, a popular Semantic Web standard, will be available “soon” according to the company. Extractive also offers an API, allowing customers to bypass the web site.
ReadWriteWeb’s Guide to The Semantic Web:
Extractiv is free to try, but if you’ll be a moderate or heavy user of the service then you’ll have to pay (the pricing is as yet unavailable on the web site).
Extractiv vs Calais
Deysarkar told ReadWriteWeb that Extractiv is targeting “mid-market Calais customers” – such as media companies or those developing search applications, monitoring services, recommendation engines or aggregators. He also claimed that Extractiv goes beyond what Calais offers, because it can mine sentiment data (which is data about how people feel about products and services).
Extractiv also wants to “provide access to more types of semantic information than any other provider.” As CEO of partner company LCC, Andrew Hickl, put it, “if you’re interested in baseball pitchers, a generic type like PERSON just won’t cut it.”
At launch, Extractiv offers about 250 different types of named entities, but it aims to have more than 3000 different entity types by the end of the U.S. summer.
Preparing For the Future of the Web
The product is not aimed at the consumer market, so it’s not for the faint hearted and you need to know what to do with all of that XML or JSON data! It also remains to be seen how competitive it is with Calais, which is a proven performer and has many reputable companies as its customers. Some startups have taken on Calais before, but fallen short.
However, there is undoubtedly a need for products like Extractiv and Calais that turn the Web’s unstructured data into meaningful, organized content. This is the future of the Web, because there is going to be a large increase in the quantity of data online over the next 5-10 years – and all of that data will need to be structured if we’re going to be make the best use of it.