Home Gnip CEO on the Challenges of Handling the Real-Time, Big Data Firehose

Gnip CEO on the Challenges of Handling the Real-Time, Big Data Firehose

Last fall, Twitter announced a partnership with Gnip, making the latter company the only commercial provider of the Twitter activity stream. And although the “firehose” metaphor has been beaten to death, says Gnip CEO Jud Valeski, it still holds true.

Valeski spoke today at Gluecon about the challenges of handling the firehose – what it means to process high volume, real-time data streams and to be able to do so “in a consistent and predictable manner.”

Recent statistics demonstrate just how high a volume this Twitter data really is. Twitter is seeing around 155,000,000 tweets per day. At about 2500 bytes on average for each tweet, do the math and calculate that Twitter (and Gnip) are truly handling an immense amount of data – about 35 Mb per second – and handling it at a sustained rate.

Valeski spoke today about how this big data stream doesn’t work with “the pipes we’re used to.” Rather than the typical http services with “standard, highly nascent TCP connections,” this sort of real-time big data streaming is a very different scenario – something skin to video streaming. Furthermore, the connections can no longer be nascent and small. They are “full blast connections.” Also the processing dynamics are different as well; the synchronous GET request handling just doesn’t work.

Valeski says that it requires “big guns and budget” if you want to consumer technology at this volume, which provides a challenge in itself but also when you want to be able to offer processing or filtering for customers who don’t have the big guns, the budget or even the desire for the volume.

It’s an optimization challenge, says Valeski, pointing to the network infrastructure itself as something that isn’t really able to handle the volume.

But the challenge is also the dearth of tools to handle streaming big data. While a number of tools have been developed to handle static big data sets – tools like Hadoop, for example – “the equivalent for streaming data sets isn’t there,” says Valeski. “Everyone is building custom stuff right now,” he added, urging people to build the tools to deal with these kinds of streaming big data problems.

About ReadWrite’s Editorial Process

The ReadWrite Editorial policy involves closely monitoring the tech industry for major developments, new product launches, AI breakthroughs, video game releases and other newsworthy events. Editors assign relevant stories to staff writers or freelance contributors with expertise in each particular topic area. Before publication, articles go through a rigorous round of editing for accuracy, clarity, and to ensure adherence to ReadWrite's style guidelines.

Get the biggest tech headlines of the day delivered to your inbox

    By signing up, you agree to our Terms and Privacy Policy. Unsubscribe anytime.

    Tech News

    Explore the latest in tech with our Tech News. We cut through the noise for concise, relevant updates, keeping you informed about the rapidly evolving tech landscape with curated content that separates signal from noise.

    In-Depth Tech Stories

    Explore tech impact in In-Depth Stories. Narrative data journalism offers comprehensive analyses, revealing stories behind data. Understand industry trends for a deeper perspective on tech's intricate relationships with society.

    Expert Reviews

    Empower decisions with Expert Reviews, merging industry expertise and insightful analysis. Delve into tech intricacies, get the best deals, and stay ahead with our trustworthy guide to navigating the ever-changing tech market.