Data extracted from 500 million Twitter messages was released today by a tiny Texas startup company that forward-looking geeks have been watching for a year. Austin-based Infochimps announced this afternoon that it is now selling two important and very large sets of Twitter data. Limited samples of the data are available for free and a third, most important, set of data still won’t be ready for a few more hours.
“What we want is to see people use this to build web apps,” Infochimps co-founder Flip Kromer told us today. “You take this data, mash it up with any other very large corpus of data with timestamps – and you’ve got a web app.”
This is particular, extracted data though – not the full text of Tweets. “We’re trying to be careful,” Kromer says, “we are not yet exposing the contents of tweets.” And this data isn’t cheap if you want the numbers broken out by the hour instead of the month.
This is a very big move because most developers struggle to get access to a large quantity of data from Twitter.
Here’s what InfoChimps is putting on sale:
Tweet #38 in the History of Twitter: “oh this is going to be addictive” – by @dom
- Hashtags, links and smiley emoticons used across Twitter on an hour-by-hour basis.
- @ messages, RT and favorites and who they came from: 1 billion relations, making what the company calls a “conversation metric.”
- A useful if less exciting set of data that will help developers map user ID numbers from search.twitter over to the different ID numbers used in the primary Twitter API. These systems were never merged and it can require a lot of API calls to merge user data.
The company believes it is capturing about 10% of the total data on Twitter right now, but Kromer says that he believes he can ramp that up to 30%.
Data as a Pot of Gold
InfoChimps is a bulk data marketplace with more than 5000 data sets in its catalog so far. The vast majority are free and were added by the company’s own staff, but not all. The decades-old polling firm Zogby International, for example, is selling some Iraqi polling data through InfoChimps. Cross-reference that polling data with publicly available data about civilian casualties in Iraq and you can see some interesting patterns, InfoChimps’ PR rep Josh Dilworth told us. (Dilworth is known as the most data-savvy PR guy in the Web 2.0 world and also represents Wolfram Alpha and Twine.)
The company hopes that it can sell the data derived from sitting on the Twitter API as a demonstration of the value that this and other data sets have. InfoChimps says it can help companies monetize data that they’d otherwise be paying to serve up through repeated API calls, if at all.
From sentiment analysis (not yet an option with the current InfoChimps data set) to social graph discovery (definitely an option), we’ve written extensively here before about the impacts that social data could have on business, social and political policies in the future.
John Zogby, founder of polling firm Zogby International, spoke to us at length (in a separate phone interview several months ago) about the value of using online social networks to measure public opinion. “We’ve been particularly known for innovating and polling new technologies,” he said.
“83% of all households are online today and 92% of likely voters, so with online polling we are today about where the country was with telephone penetration when telephone surveys started. Social networking is not as representative as online access [in general] yet, but I’m comfortable with caveats: that you can do a random sampling, so long as you claim that’s what your universe is, as long as you don’t extrapolate to all Americans, etc. It has tremendous, tremendous value.
“I know that the landline era is coming to an end – not today or tomorrow but we’ve got to find new and different ways of doing our work. It’s the same kind of crossroads as the ’70s, when we moved away from the door-to-door and mail-in results to the landlines.
“Online, frankly just like telephone, doesn’t have the minority population, but for market surveys you may be looking for a different kind of consumer.
“We know that the landline phone is pushing us away; we know that we can’t use the cell phone in the same way; and we know that we’ve got to reinvent this industry [of measuring public opinion]. What’s happening are simultaneous new technologies and at the same time growing penetration of these new technologies. We’re riding a bucking bronco.”
Use Cases
The conversation metric data that InfoChimps is selling is the most exciting to me. Imagine a third-party app using historical social-conversation data to filter Twitter or other messages based on the strongest social connections that I or other people have. Imagine, for example, social Q&A service Aardvark combining the Twitter Lists API with this InfoChimps data set for a scenario like this: “You have a question about stock options? How would you like us to find a person who knows about that, is regularly conversed-with by people on Robert Scoble’s Twitter list of Venture Capitalists and is available right now?” That sounds pretty great to me.
The possible applications are many. “I see Twitter as a data acquisition device for what people talk about and how they relate to each other,” InfoChimps’ Kromer says.
Right now InfoChimps is selling the hashtag and link dataset for $8,000 and the social metric data set for $9,500. Eventually the company will likely move to a subscription model.
How They Got the Data
How did InfoChimps get the data? The company hits the Twitter Developer API 20,000 times an hour (the standard for developers) but takes big swaths of data each time it does. “I have a priority queue,” Kromer told us.
“I can set a search term, and for each search term I can get 1500 tweets per API call. If I get 1500 tweets at a time, then the number of wasted tweets at the end of a series of searches is the smallest. If I’m searching for a term and get less than 1500 results back, then I forecast how long it will take to fill that number of results back up to the maximum and move it down the priority queue accordingly. On the lowest priority I have searches for RT or http. There will always be 1500 results for that. It’s only API calls that limit me. As is, it’s like a fisherman setting nets: what matters is that dinner is tasty.”
Does that sound so hard? Worth thousands of dollars? Here’s what Kromer says:
“It’s not magic. If you talk to people who use Hadoop and do social networking analysis, this is underwhelming. You take 30 million users, 1 billion links, adorn each link with info at the end of the link and acrue it with the person at the head of the link. That breaks conventional databases; the plumbing is hard. The math is easy but when you do it a billion times, it starts to get interesting. You have to be careful and clever. We plan to do stuff that is structural – a clustering co-efficient true pagerank.”
Ultimately it’s about specialization and data as a service. “The people we need to come in and connect this info with human beings,” Kromer says, “aren’t the people who should be wasting their time on the math. And the guys who are good at doing these things should not be building Web apps.”
But Can They Get Away With It?
There’s some question whether Twitter will allow InfoChimps to sell data based on Twitter data. Kromer says he’d much rather resell the data on a commission than have to do all the work he’s done to set up the extraction system. But it was a year ago that InfoChimps caught the eye of people who love data: by releasing a large collection of scraped Twitter data.
The InfoChimps blog post for that read: “Big huge thanks to twitter.com: they have given us permission to share this freely. Please go build tools with this data that make both twitter.com and yourself rich and famous: then more corporations will free their data.”
But then Twitter founder Evan Williams asked InfoChimps to take those data sets down until a Terms of Service for them could be figured out. That never happened, and communication between the two companies hasn’t progressed very far over the last year.
InfoChimps does not have Twitter’s permission to do what it did today, but Kromer says Twitter hasn’t contacted them either. No one from Twitter headquarters has responded to our request for comment yet.
“We talked to our lawyer about this a lot,” Kromer told us, “we are on absolutely solid ground with regards to copyright, user privacy and use of the API. This is clearly for the benefit of their community.”
That’s nice that Kromer feels so assured, but his attitude seems a little unrealistic.
We asked technology journalist Robert Scoble what he thought of the dilemma, and his opinion is pretty clear. “If Twitter wants to be a platform, they have to behave like a platform,” he said. “Don’t be king-makers. Let the marketplace choose the winners. If they are going to say nobody should study the data because we’re going to sell that, that’s not being a platform. Twitter tries to pick the winners and it pisses me off. They admit that they are king-makers. All that does is make everyone vote against them and hope a competitor comes around.”
Perhaps time will tell. But these are very early days in what looks to be an era of widespread innovation built on top of social data analysis.