Defrag that Gnip will be the only commercial providers of Twitter's activity stream raises a lot of questions, so I sat down with Chris Hogue, Rob Johnson and Jud Valeski from the Gnip team to get some answers from a technical perspective.Today's announcement at
The first thing I wanted to know was the nuts-and-bolts of accessing the stream using Gnip. One fundamental advantage that Gnip offers is that they have redundant firehose trunk streams coming into independent Amazon Availability Zones, which they're then able to syndicate across the internal network to every machine. External developers get access to this by renting a customized Gnip machine in this cluster, and writing a receiving application that listens to the streaming HTTP connection it's given from this master stream. That sounds a lot like the current way you access the firehose, but there are some key distinctions.
Monitoring. Gnip has a detailed set of reporting tools on the state of your processing pipeline, with chart-level overviews and more detailed tables you can drill down into. They also have a round-the-clock operations team who watch for underlying issues, alert you to them, and even try to solve any problems they spot by restarting your processes, with your permission of course!
Robustness. With redundant clusters spread across two data centers, they're able to offer a level of reliability that's hard for most independent companies to achieve.
Activity streams. All of the messages can be automatically converted to the Activity Stream standard and encoded in JSON, rather than using Twitter's proprietary structure. This allows you to write code that works across a lot of other sources, not just Twitter.
Going forward, Gnip plan to add a lot more features to help developers. There's many services they offer on their traditional feeds that they want to bring to the high-volume Twitter stream, like more advanced filtering, sentiment analysis, URL unwinding and sorting by a user's influence score. They're also very excited by the prospect of moving into storage, and offering the ability to access older messages, something that Twitter has never been able to do reliably.
So that's what external developers will experience, but I wanted to know more about what's going on under the hood. The key point Jud and the team wanted to get across is that dealing with the volume of data Twitter is throwing at them is a very hard engineering problem. As Chris put it "You can't consume the firehose from home", and Jud asked me to imagine how much it would cost to upgrade a typical business connection to handle a reliable five megabits per second, with bursts of four times as much? Rob said that Gnip's business is solving "the shipping and handling problem for social streams". Bandwidth was the bottleneck, so how did they solve that?
Location turned out to be really important. When they experimented with running their cluster from a data center on the East Coast, it turned out to be a massive challenge to send the volumes of data required across the country on the public Internet. Instead, much like high-frequency trading software, they ended up using West Coast locations that were close to Twitter's own machines. They do have a few tricks up their sleeve that might open up East Coast processing though. In experiments they've been able to use OpenVPN to compress the stream down to a more manageable volume on-the-fly, a clever application of a technology that's usually only thought of as an encryption tool.
Chris gave a lot of credit to Java for allowing them to build robust, long-running processes. In particular, the curse of any program that needs to keep working for days or weeks is memory leakage or bloat, and Java has the tools to give the insights that you need to debug those sort of problems. It also helps that several of the team have many years of experience working with the language and so have picked up a few tricks! In the event that something does go wrong, they have it wrapped in the Java Service Wrapper, which has tight integration with the process to spot problems and restart it if needed.
He also talked about a couple of frameworks that served them well. Netty is an asynchronous event-driven network IO framework that made it easy to handle thousands of connections without complicating their code. Since their whole pipeline relies on JSON, it was crucial they use a fast parser and Jackson has worked out well.
On the OS side, they're using a pretty vanilla CentOS distribution, but they did end up spending some time tweaking the OS's TCP settings to deal with the volume of the streams.
The big question every developer will ask is "Why did Twitter choose Gnip?" I've known the team for years, so I know how much blood, sweat and tears have gone into figuring out how to cope with mind-blowingly massive streams of data. They really do have a unique level of experience in processing this level of data, and I'm not surprised Twitter turned to them when they needed a solution. I hope this run-down sheds a bit of light on why what they're offering is so attractive to Twitter and why it's good news for commercial third-party developers too.