Struggling with even the most cutting-edge big-data tools, developers at Facebook have announced a new open-source tool set, Corona, designed to enable data-processing on the incredible scale required by the massive social-media network.
Admittedly, there aren't a lot of companies with big data this big, but Corona should have spill-over benefits for the entire Hadoop-using big data community. Increasing Hadoop's overall efficiency should make big data even more accessible and faster to use, enabling big data applications to disrupt even more industries.
Conventional databases that reach into the terabytes are difficult and expensive to work with. It's not just the size of databases that cause problems, either. Growth - of data itself, systems and queries - can waylay the most expertly managed database. It's hard to keep all of the data tables and relationships intact across the new machines you have to keep adding to store all that data.
Really, Really Big Data
Keep that in mind, and then ponder Facebook's problem: Its largest collection of data is 100 petabytes. That's 1,048,576,000 gigabytes. Think you're pretty bad ass with your 3TB hard drive? This is the equivalent to over 34,000 such drives. And that's just Facebook's largest collection of data, not all of it.
And growth? According to Facebook engineers, they're getting about half-a-petabyte coming in from the social network every 24 hours - 512 terabytes every single day.
This is the kind of data load that would make all but the most expensive database software and hardware on the planet go pfft! with a tiny puff of smoke and a barely audible scream of pain from the server room.
Tools like Hadoop have been incredibly useful in collecting and working with this kind of massive scale. But in a blog post today from Facebook engineers Avery Ching, Ravi Murthy, Dmytro Molkov, Ramkumar Vadali and Paul Yang, even Hadoop is straining under the load. That 100-petabyte data cluster, for instance? It's the largest known Hadoop cluster in the world.
The Limits Of Hadoop
Hadoop isn't really a database in the usual sense. It's a two-part data processing system.
Part one is storage and tracking data across multiple machines. Part two is the MapReduce system that takes a query about the data, figures out all the places in the cluster where the data is sitting, and then divides the query up to be processed on the same machines.
Once processed, the answers to the query are combined and delivered. No one machine has to do all the heavy lifting, and adding a new machine to a data cluster adds storage and processing power.
But in their blog today, the Facebook engineers highlight a limitation in Hadoop that has plagued Facebook and many large-scale Hadoop users. All of that MapReduce work described above has to be done in batches, which in turn have to be scheduled so that machines in a given cluster can do the work required without overwhelming the resources of the cluster.
By default, Hadoop uses a fair scheduler, which basically means first come, first served. For Facebook, the limitations of fair scheduling are very real. In early 2011, the team noted, the job trackers that schedule queries and manage cluster resources were having trouble keeping up.
Imagine a vast supermarket full of all sorts of stuff. At the entrance to the market is a collection of grocery carts, all various sizes. To enter the market, you have to wait in line to get a cart. The size of the cart you get depends on how much is on your grocery list. Just need milk and bread? You get a tiny cart. Stocking up for Thanksgiving? You get the jumbo cart.
Here's the other catch: Only a limited number of shoppers are allowed in the supermarket at a time. So, everyone has to wait while a few shoppers have the run of the store and check out at the (single) register.
That's a rough idea of how Hadoop works now. Frustrated with this system, Facebook's team is opting to change this model with a new scheduling framework, known as Corona.
The Rise Of Corona
Corona divides the job tracker's responsibilities in two. First, a new manager manages cluster resources and keeps an eye on what's available in that cluster. In our grocery-store analogy, Corona lets more shoppers into the store and opens more checkout lanes.
At the same time, Corona creates a dedicated job tracker for each job, which means the job tracker no onger has to be tied to the cluster. With Corona, smaller jobs can be processed right on the requester's own machine. Continuing the shopping example, it's like putting an automat outside the store that will give you anything in the store, but a few items at a time.
Facebook has been using Corona for some time, according to the blog, and is still working to improve it, as well as to apply its improved scheduling features to other types of applications. More important, the team revealed that their efforts have been open-sourced, which means other developers can take this work and apply it to their own Hadoop systems.
Improved job scheduling in Hadoop is a pretty big deal, if only because any Hadoop system can be improved with this kind of functionality. It's something the Apache Hadoop developer team has already been working on in their new version of MapReduce 2.0, with the introduction of the YARN ResourceManager. But at the scales Facebook is working in now, the engineering team could not wait for YARN.
Given that Corona is also open source, there's a good possibility that work on MapReduce 2.0 will benefit from Facebook's efforts, improving everyone's Hadoop experience. That's something anyone in big data can Like.
Image courtesy of Facebook.