Big Data no longer looks like a Hadoop monopoly, but it’s not yet clear exactly what its future will be.
For years, the open-source, data storage-and-management framework Hadoop—and its associated data-processing tool MapReduce—were virtually synonymous with Big Data. Now, though, the would-be Big Data Scientist has a much larger array of software tools from which to choose, one of the most promising being Spark, which I covered recently.
Spark and other tools herald the arrival of an emerging trend towards “fast data,” which poses a lot of questions for how Big Data jobs get done. Historically, the approach has been to run large MapReduce jobs in batches on dedicated clusters with Apache YARN as the default cluster manager.
See also: Hadoop 2.0 And YARN—This Summer’s Big Data Breakthrough
Maybe there’s a better way.
Developers from Ebay, MapR and Mesosphere have collaborated to release Project Myriad, a framework that integrates YARN with Apache Mesos—another open-source cluster manager—to run Big Data workloads on the same clusters as other applications in the datacenter and the cloud. Today its developers submitted Myriad to the Apache Incubator, affirming their commitment to open-source collaboration.
I spoke with Adam Bordelon, distributed systems architect at Mesosphere, Apache Mesos Committer, and a key committer on Project Myriad, to learn more about the benefits of moving Big Data workloads out of standalone, dedicated YARN clusters and into a single shared pool of resources where YARN workloads run alongside the rest of your datacenter applications.
Minding Your Knitting
ReadWrite: Tell us a little bit about the origins of the project and why its committers saw a need to extend the capabilities of YARN.
Adam Bordelon: Apache Hadoop is the de facto standard for running big data workloads today, but the original MapReduce JobTracker could only scale to a few thousand nodes. To scale further, YARN took the resource-management component out of the JobTracker and moved it into its own separate process.
As Hadoop gains traction and becomes the home for the data lake, there is an increasing need to integrate Hadoop with other datacenter services, ideally co-locating the data in HDFS/HBase with the non-Hadoop services that need it.
But the typical Hadoop deployment model favors static partitioning of datacenter resources into separate Hadoop clusters, database clusters, web server clusters, etc. This practices under-utilizes the overall datacenter resources, and exhibits poor data sharing between Hadoop clusters and other applications in the same datacenter or cloud.
Last year Mohit Soni—an engineer at eBay—had the idea of using a Mesos cluster to elastically run YARN alongside other workloads. He was specifically interested in offloading traffic during peak hours, as well as solving for data replication challenges across different data silos, where different data sets were marooned on separate YARN clusters.
But he also had a broader vision of a comprehensive framework that combined YARN with Apache Mesos to finally (and cleanly) break Big Data workloads out of dedicated static clusters and allow YARN to coexist with non-Hadoop applications including long-running web services, streaming applications (e.g. Storm), continuous integration tools (e.g. Jenkins), HPC jobs (e.g. MPI), Docker containers, as well as custom scripts and applications.
Virtualizing Big Data Workloads
RW: Who is going to be most excited about the general availability of Myriad?
AB: It’s really the operations teams who will be most excited about Myriad (analytics teams typically are not as concerned with how to share their resources with other clusters). But the poor ops teams have been wondering why all these Hadoop data scientists get their own resources—because it adds a huge amount of complexity to manage multiple clusters within the datacenter, and the aggregate utilization rates are very poor when you have dedicated Hadoop clusters isolated from other workloads.
Myriad addresses two important goals for ops. One is improving cluster utilization—rather than the Hadoop cluster crunching numbers overnight and sitting relatively idle during peak web traffic hours, Myriad enables Mesos to dynamically share resources between the Hadoop cluster and Web servers and other applications on demand, even simultaneously co-locating Hadoop jobs on the same machines as other tasks, an approach that can easily double or triple utilization.
The other goal is easier administration. With statically partitioned clusters, if you wanted to add a new node to your Hadoop cluster, you’d have to execute a lot of manual procedures, decommissioning an underutilized server, then configuring it to become a Hadoop node. With Myriad, workloads can just expand into unused capacity when those resources are needed.
RW: As Big Data analytics become more real-time, what does the complexity look like on the back end?
AB: When the data became bigger than the compute, we started moving the compute to the data, rather than the other way around. This is the principle that MapReduce is based on.
But as the compute itself becomes faster, the demands of real-time or interactive analytics pushes us to reduce the overhead of scheduling and launching short-lived tasks. Mesos’ two-level scheduler model enables Mesos itself to be thin and fast in its scheduling decisions, while individual frameworks like Marathon or Spark (originally developed as an example Mesos framework) can choose their own scheduling policies, either spending a long time deciding the best place for a long-running service, or quickly placing a real-time task on the first available resources.
This approach is preferable to a monolithic scheduler that treats long-running jobs and interactive queries equally, forcing the same scheduling overhead on all workload classes.
YARN vs. Mesos
RW: There is a pretty spirited Quora thread where YARN and Mesos compare and contrasts are debated pretty heavily. What does it mean that Big Data adopters no longer have to choose?
AB: Yeah, as Jay Kreps says in that thread, YARN and Mesos have the same goal—to share a large cluster of machines between different frameworks.
The biggest difference is that YARN was designed to be Hadoop-specific and Mesos was designed to handle an infinite range of workload classes with custom per-framework schedulers.
What’s really exciting here is that organizations that are very committed to Hadoop and MapReduce now have a way to elastically expand their YARN cluster while at the same time taking advantage of Mesos’ ability to run any other kind of workload, including non-Hadoop applications like web servers, mobile backends, distributed databases and other types of common services.
And at the same time, the community that’s already running Mesos can now tap into the power of YARN on their unified Mesos cluster. Running Myriad requires no changes to YARN or Mesos source code, so it can be easily integrated into existing Mesos or Hadoop clusters.
RW: How does Myriad fit into some of the larger trends affecting the datacenter?
AB: Myriad is one of the clearest examples of how companies are starting to treat the datacenter as if it were just one big computer, where you can install new “killer apps” like YARN, Spark, Kafka, Cassandra and HDFS with a single command and run all of these services multitenant on the same cluster, while isolating them from each others’ resources using Linux containers.
Much of the work we’re doing at Mesosphere, building an operating system for the datacenter, the Mesosphere DCOS, is based on the belief in this trend.
Myriad will bring YARN to the same level as all of these easy-to-install services—where YARN is just another framework that runs reliably and efficiently alongside other common services on a datacenter-scale distributed operating system,
Image by Quinn Dombrowski