A Twitter Storm Arrives: Storm Project Open Sourced

In August, Twitter acquired BackType, a social media analytics company. One of the things that Twitter picked up in that acquisition was Storm, the “Hadoop of realtime processing.

At the time, Twitter said that it would open source Storm in September at the Strange Loop conference in St. Louis. Guess what? They did. As of this week, Storm is on GitHub under the Eclipse Public License (EPL).

What is Storm?

“Storm is a distributed, reliable, and fault-tolerant stream processing system. Its use cases are so broad that we consider it to be a fundamental new primitive for data processing.”

Other than Marz’s announcement at Strange Loop and his

announcement on Twitter

, the official open sourcing of Storm hasn’t gotten much attention. Which is a shame, because Storm looks to be a solution that many developers are going to want to check out. According to Storm’s documentation, data processing has seen a revolution in the past decade, but systems like Hadoop are not real-time. Services like Twitter need to process data in real time.

So what is Storm anyway? Nathan Marz (formerly with BackType, and the guy maintaining it on GitHub), described it like so: “Storm is a distributed, reliable, and fault-tolerant stream processing system. Its use cases are so broad that we consider it to be a fundamental new primitive for data processing.”

The use cases for Storm include stream processing (like processing Tweets), continuous computation and distributed real-time processing. In Twitter’s case, Storm is being used to do things like compute trending Twitter users and figure out the “reach” of a tweet; reach being the unique number of people that would see a tweet. Says Marz, “To compute reach, you need to get all the people who tweeted the URL, get all the followers of all those people, unique that set of followers, and then count the number of uniques. It’s an intense computation that potentially involves thousands of database calls and tens of millions of follower records.”

That requires, well, a lot of computing, but is made much simpler by Storm. Marz says “It can take minutes or worse to compute on a single machine. With Storm, you can do every step of the reach computation in parallel and compute reach for any URL in seconds (and less than a second for most URLs).”

Storm: distributed and fault-tolerant realtime computation

presentations

from

nathanmarz

Storm is meant to deal with real-time computation, be scalable, guarantee no data loss, and be programming language agnostic.

Want to catch up on Storm? Check out Marz’s presentation from Strange Loop and the rationale page. We think it’s pretty interesting, and very much worth keeping an eye on.