Machine processing of large quantities of unstructured text, to discover media mentions, relationships between entities and sentiment analysis need not be priced out of the range of the everyday web lover or small business.
Tonight two Texas companies announced a collaboration that brings exactly that to market, at a disruptively low price. Web crawling service 80Legs and Natural Language Processing service Language Computer Corporation have combined their efforts to create Extractiv, a web crawling and semantic analysis service offered at an affordable price. I’ve already put it to use to perform some awesome bulk text analysis for my own work.
Above: Extractiv correctly identified the people, places and dates in my article today about Jay Adelson’s new job. It only misidentified one geek as an athlete, not bad. Picture this analysis spread over hundreds of thousands or millions of documents and you are, as they say, cooking with gas.
Testing the Tool
To test Extractive, I gave the company a collection of more than 500 web domains for the top geolocation blogs online and asked its technology to sort for all appearances of the word “ESRI.” (The name of the leading vendor in the geolocation market.)
The resulting output included structured cells describing some person, place or thing, some type of relationship it had with the word ESRI and the URL where the words appeared together. It was thus sortable and ready for my analysis.
The task was partially completed before being rate limited due to my submitting so many links from the same domain. More than 125,000 pages were analyzed, 762 documents were found that included my keyword ESRI and about 400 relations were discovered (including duplicates). What kinds of patterns of relations will I discover by sorting all this data in a spreadsheet or otherwise? I can’t wait to find out.
That work took the machine about an hour and would have cost me less than $1, after a $99 monthly subscription fee. The next level of subscription would have been performed faster and with more simultaneous processes running at a base rate of $250 per month.
The machine isn’t perfect – but it looks very impressive for having just launched this evening. Would I use Extractiv for my bulk text analysis again in the future? Of course I would, in fact I intend to start thinking about what text I’d like analyzed next immediately.
This sort of service represents an incredible vision of the future: commodity level, DIY analysis of bulk data produced by user generated or other content, sortable for pattern detection and soon, Extractiv says, sentiment analysis.
The People Behind the Technology
80Legs is lead by CEO Shion Deysarkar, a former oil industry computer scientist turned social network data hacking entrepreuer whom we profiled this Spring. (Thoughts From the Man Who Would Sell The World, Nicely) Deysarkar and 80Legs CTO Toan Duong describe themselves online as employed by Creeris Ventures, a Houston venture capital firm with a diverse portfolio including grid computing, jet airplanes and litigation.
The Extractiv collaborators Language Computer Corporation include John Lehmann, CEO at LCC since September and President at Extractiv. Also co-founding the company is Andy Hickl, an NLP expert of the highest order, most recently of question-answering machine Swingly.