At the time, Twitter said that it would open source Storm in September at the Strange Loop conference in St. Louis. Guess what? They did. As of this week, Storm is on GitHub under the Eclipse Public License (EPL).
What is Storm?
So what is Storm anyway? Nathan Marz (formerly with BackType, and the guy maintaining it on GitHub), described it like so: "Storm is a distributed, reliable, and fault-tolerant stream processing system. Its use cases are so broad that we consider it to be a fundamental new primitive for data processing."
The use cases for Storm include stream processing (like processing Tweets), continuous computation and distributed real-time processing. In Twitter's case, Storm is being used to do things like compute trending Twitter users and figure out the "reach" of a tweet; reach being the unique number of people that would see a tweet. Says Marz, "To compute reach, you need to get all the people who tweeted the URL, get all the followers of all those people, unique that set of followers, and then count the number of uniques. It's an intense computation that potentially involves thousands of database calls and tens of millions of follower records."
That requires, well, a lot of computing, but is made much simpler by Storm. Marz says "It can take minutes or worse to compute on a single machine. With Storm, you can do every step of the reach computation in parallel and compute reach for any URL in seconds (and less than a second for most URLs)."
Storm is meant to deal with real-time computation, be scalable, guarantee no data loss, and be programming language agnostic.
Want to catch up on Storm? Check out Marz's presentation from Strange Loop and the rationale page. We think it's pretty interesting, and very much worth keeping an eye on.