Lately, it seems like Hadoop, the open source data platform seemingly so integral to the rise of Big Data, can’t catch a break. Instead of relying on Hadoop as a key Big Data storage and analysis tool, vendors and observers are increasingly positioning it as “just” a storage tool.
But this isn’t necessarily a bad thing. Using Hadoop for cheap and efficient storage is a perfect starting point for the next phase of Hadoop’s evolution. With this summer’s expected debut of Hadoop 2.0, changes are afoot that will make information found within data warehouses and unstructured “data lakes” more accessible than ever.
Hadoop As A Big Bucket
Hadoop has been a great system for storing data since its initial adoption as a Big Data tool, but the data-processing framework MapReduce, which requires the creation of Java apps to reach into stored data and pull out the information required, has a high learning curve.
(See also Hadoop: What It Is And How It Works.)
There are other ways to get information out of Hadoop, of course. The HBase database is included in Hadoop, letting users treat data with a database paradigm. And the Hive data warehouse system enables you to build queries in the SQL-like HiveQL query language that can be converted to MapReduce jobs. But Hadoop is still limited by the fact that everything you do in it still has to be done one thing at a time. MapReduce jobs, Hive queries, HBase operations… they all have to take turns.
This is why a lot of vendors tend to frame Hadoop as the bucket in which data is stored, and cast their products as the magical tool to pull out or analyze that data. In fact, while the data bucket metaphor is apt, it has been super-sized among Hadoop users to become known as data lakes or even data oceans. Given the perceived limitations of Hadoop in its present state, it’s not a hard sell to make.
But as the Hadoop development community starts ramping up for the next iteration of Hadoop, those limitations are about to be greatly reduced.
Knitting A YARN Solution
For Arun Murthy, the release manager for Hadoop 2.0, the most important change will be upgrading the MapReduce framework to Apache YARN, which will expand what software can be used in Hadoop and how much. Murthy, who is also YARN project lead and co-founder of Hortonworks, explained that “In Hadoop 1.0, everything was batch-oriented. In 2.0, you will now have multiple apps hitting the data inside all at once.”
What YARN does, essentially, is divide the functionality of MapReduce even further, breaking the two major responsibilities of the MapReduce JobTracker component – resource management and job scheduling/monitoring – into separate daemons: a global ResourceManager and per-application ApplicationMaster.
Splitting up these functions provides a more powerful way to manage a Hadoop cluster’s resources than the current MapReduce systems can handle. It manages resources similar to the way an operating system handles jobs, which means no more one-at-a-time limitations.
With YARN, developers will be able to build apps directly within Hadoop, instead of bolting them on from the outside, as many third-party vendor tools have to do now.
Murthy reported that the Apache Hadoop community is already seeing keen interest from vendors who want to build their apps within the YARN framework that will live directly inside Hadoop and have resources managed by YARN.
Because the Apache Hadoop community is driving the development of the new version of Hadoop, there is no set timeline for Hadoop’s progress this summer. Murthy predicted that a “strong beta” of Hadoop 2.0 might be available in June or July timeframe, with a final release perhaps ready by August.
If YARN lives up to its promises, a lot of data lakes and oceans will suddenly be more accessible within the native Hadoop platform, which will greatly streamline and speed up the task of finding useful information. Big Data is about to get even more useful.