Last fall, Twitter announced a partnership with Gnip, making the latter company the only commercial provider of the Twitter activity stream. And although the “firehose” metaphor has been beaten to death, says Gnip CEO Jud Valeski, it still holds true.
Valeski spoke today at Gluecon about the challenges of handling the firehose – what it means to process high volume, real-time data streams and to be able to do so “in a consistent and predictable manner.”
Recent statistics demonstrate just how high a volume this Twitter data really is. Twitter is seeing around 155,000,000 tweets per day. At about 2500 bytes on average for each tweet, do the math and calculate that Twitter (and Gnip) are truly handling an immense amount of data – about 35 Mb per second – and handling it at a sustained rate.
Valeski spoke today about how this big data stream doesn’t work with “the pipes we’re used to.” Rather than the typical http services with “standard, highly nascent TCP connections,” this sort of real-time big data streaming is a very different scenario – something skin to video streaming. Furthermore, the connections can no longer be nascent and small. They are “full blast connections.” Also the processing dynamics are different as well; the synchronous GET request handling just doesn’t work.
Valeski says that it requires “big guns and budget” if you want to consumer technology at this volume, which provides a challenge in itself but also when you want to be able to offer processing or filtering for customers who don’t have the big guns, the budget or even the desire for the volume.
It’s an optimization challenge, says Valeski, pointing to the network infrastructure itself as something that isn’t really able to handle the volume.
But the challenge is also the dearth of tools to handle streaming big data. While a number of tools have been developed to handle static big data sets – tools like Hadoop, for example – “the equivalent for streaming data sets isn’t there,” says Valeski. “Everyone is building custom stuff right now,” he added, urging people to build the tools to deal with these kinds of streaming big data problems.