If you’ve been within shouting distance of me over the last month, you’ve probably heard me singing the praises of Needlebase, a great new point-and-click tool for extracting, sorting and visualizing data from across pages around the web. I’ve been using it for all kinds of things and now you can too.
When we first reviewed Needle here on ReadWriteWeb, it was in closed beta and new users had to request an account. Now it’s open and available for all: free for personal use or by subscription for commercial use. Check out some examples of ways I’ve used this exciting new technology below.
Needlebase allows you to view web pages through a virtual browser, point and click to train it in understanding what fields on that page are of interest to you and how those fields relate to each other. Then the program goes and scrapes the data from all of those fields, publishes them into a table, list or map, and recommends merges of cells that appear to be mistakenly separate. It’s very cool and it lets non-technical people do things with data quickly and easily that we used to require the assistance of someone more technical to do.
For example, I’ve already used Needle to do the following. But first, the official Needle demo video…
And here are a few ways I’ve used Needlebase so far.
Last month a local newspaper reported that a big new data center had opened in Salt Lake City with a mystery anchor client. The paper believed the client was Twitter, as the company has said it was going to open its first off-site data center in Utah at an undisclosed date.
We used Needlebase to look at all the tweets from people on the Twitter list of Twitter staff members and extract the username, message body and location, if exposed. Needlebase scraped the last 1500 Tweets in less than 5 minutes. We displayed them on a map and saw that there was just one Tweet published in that time from Utah: a Twitter Site Operations Technician who had just left San Francisco to move to Salt Lake City, complaining about Qwest router problems. That wasn’t quite confirmation, but it sure felt like a valuable clue and was very easy to come by thanks to Needlebase.
Last night I found a solution to a long-running issue I’ve been struggling with. I’ve got this list of 300 blogs around the web that cover geotechnology (that’s a whole other story) and have them all run through Postrank. That service ranks them in order of most to least social media and reader engagement per blog post.
Wouldn’t it be great to extract that data over time, to track it and to turn it into blog posts? I think it would. I couldn’t figure out how to get all the data out that I wanted though.
Enter Needlebase. Last night I pointed Needle to my Postrank pages for geotech blogs and in minutes it pulled down all the data I wanted. I exported that data as a CSV, uploaded it to Google Docs as a spreadsheet, did a little subtraction and now have the following chart tracking the top 300 geotech blogs on the web. Now in my handy spreadsheet, I was able to set up a function to show me which blogs jumped or fell in the rankings the most over the previous week. Thanks, Needlebase!
I’ve written here about how to use Mechanical Turk to get ready and rock an industry event. Needlebase can prove useful for that as well.
My wife Mikalina, for example, has used Needle to extract the session titles, speakers, topic tags and more information about all of the SXSW Interactive sessions that have been announced so far. The sky’s the limit on what could be done using that.
There are all kinds of other ways that a tool like this can be used. There is a learning curve, but it’s nothing compared to what it would take to learn to do this kind of work programmatically. When we first reviewed Needlebase, beta invites had to be requested by email. We got emails, which were then forwarded to the company, from a wide variety of people. A Japanese potter, a local yarn store owner, a Geocacher who wanted to organize his online geocaching information and enterprise mobile app developers, for example. We got an email from a publisher who wanted to scrape their website for place names and see what parts of the world they cover the most and least.
Needlebase was built as a side project of travel search company ITA Software. Google is currently in legal negotiations to acquire ITA (the US government isn’t sure it wants Google to own travel search too).
What will Google do with Needlebase if it gets its hands on it? I’m much more interested in hearing what you are going to do with it, now that anyone can use it.
The DIY Data Hackers Toolkit
I put Needle in my mind in between two other wonderful tools. On one end of the spectrum is the now Yahoo-acquired Dapper, which anyone can use to build an RSS feed from changes made to any field on any web page. (See: The Glory and Bliss of Screen Scraping and How Yahoo’s Latest Acquisition Stole and Broke My Heart)
One the other end of the spectrum is the brand-new Extractiv, a bulk web-crawling and semantic analysis tool that’s also remarkably easy to use. Earlier this month I used Extractiv to search across 300 top geotech blogs for all instances of the word “ESRI,” all entities mentioned in relation to ESRI and the words used to describe those relations. The service processed 125,000 pages and spit out my results in less than an hour for less than a dollar. That’s incredible – it’s a game changer.
Needlebase is too. It sits somewhere in between Dapper and Extractiv, I think. These tools are democratizing the ability to extract and work with data from across the web. They are to text processing what blogging was to text publishing.
I’ll stop now so you can go and start learning to use Needlebase. Let me know what cool things you figure out how to use it for.