You can’t have a conversation about Big Data for very long without running into the elephant in the room: Hadoop. This open source software platform managed by the Apache Software Foundation has proven to be very helpful in storing and managing vast amounts of data cheaply and efficiently.
But what exactly is Hadoop, and what makes it so special? Basically, it’s a way of storing enormous data sets across distributed clusters of servers and then running “distributed” analysis applications in each cluster.
It’s designed to be robust, in that your Big Data applications will continue to run even when individual servers — or clusters — fail. And it’s also designed to be efficient, because it doesn’t require your applications to shuttle huge volumes of data across your network.
Here’s how Apache formally describes it:
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly available service on top of a cluster of computers, each of which may be prone to failures.
Look deeper, though, and there’s even more magic at work. Hadoop is almost completely modular, which means that you can swap out almost any of its components for a different software tool. That makes the architecture incredibly flexible, as well as robust and efficient.
Hadoop Distributed Filesystem (HDFS)
If you remember nothing else about Hadoop, keep this in mind: It has two main parts – a data processing framework and a distributed filesystem for data storage. There’s more to it than that, of course, but those two components really make things go.
The distributed filesystem is that far-flung array of storage clusters noted above – i.e., the Hadoop component that holds the actual data. By default, Hadoop uses the cleverly named Hadoop Distributed File System (HDFS), although it can use other file systems as well.
HDFS is like the bucket of the Hadoop system: You dump in your data and it sits there all nice and cozy until you want to do something with it, whether that’s running an analysis on it within Hadoop or capturing and exporting a set of data to another tool and performing the analysis there.
Data Processing Framework & MapReduce
The data processing framework is the tool used to work with the data itself. By default, this is the Java-based system known as MapReduce. You hear more about MapReduce than the HDFS side of Hadoop for two reasons:
- It’s the tool that actually gets data processed.
- It tends to drive people slightly crazy when they work with it.
In a “normal” relational database, data is found and analyzed using queries, based on the industry-standard Structured Query Language (SQL). Non-relational databases use queries, too; they’re just not constrained to use only SQL, but can use other query languages to pull information out of data stores. Hence, the term NoSQL.
But Hadoop is not really a database: It stores data and you can pull data out of it, but there are no queries involved – SQL or otherwise. Hadoop is more of a data warehousing system – so it needs a system like MapReduce to actually process the data.
MapReduce runs as a series of jobs, with each job essentially a separate Java application that goes out into the data and starts pulling out information as needed. Using MapReduce instead of a query gives data seekers a lot of power and flexibility, but also adds a lot of complexity.
There are tools to make this easier: Hadoop includes Hive, another Apache application that helps convert query language into MapReduce jobs, for instance. But MapReduce’s complexity and its limitation to one-job-at-a-time batch processing tends to result in Hadoop getting used more often as a data warehousing than as a data analysis tool.
(See also Hadoop Adoption Accelerates, But Not For Data Analytics.)
Scattered Across The Cluster
There is another element of Hadoop that makes it unique: All of the functions described act as distributed systems, not the more typical centralized systems seen in traditional databases.
In a database that uses multiple machines, the work tends to be divided out: all of the data sits on one or more machines, and all of the data processing software is housed on another server (or set of servers).
On a Hadoop cluster, the data within HDFS and the MapReduce system are housed on every machine in the cluster. This has two benefits: it adds redundancy to the system in case one machine in the cluster goes down, and it brings the data processing software into the same machines where data is stored, which speeds information retrieval.
Like we said: Robust and efficient.
When a request for information comes in, MapReduce uses two components, a JobTracker that sits on the Hadoop master node, and TaskTrackers that sit out on each node within the Hadoop network.
The process is fairly linear: The Map part is accomplished by the JobTracker dividing computing jobs up into defined pieces and shifting those jobs out to the TaskTrackers on the machines out on the cluster where the needed data is stored. Once the job is run, the correct subset of data is Reduced back to the central node of the Hadoop cluster, combined with all the other datasets found on all of the cluster’s machines.
HDFS is distributed in a similar fashion. A single NameNode tracks where data is housed in the cluster of servers, known as DataNodes. Data is stored in data blocks on the DataNodes. HDFS replicates those data blocks, usually 128MB in size, and distributes them so they are replicated within multiple nodes across the cluster.
This distribution style gives Hadoop another big advantage: Since data and processing live on the same servers in the cluster, every time you add a new machine to the cluster, your system gains the space of the hard drive and the power of the new processor.
Kit Your Hadoop
As mentioned earlier, users of Hadoop don’t have to stick with just HDFS or MapReduce. For its Elastic Compute Cloud solutions, Amazon Web Services has adapted its own S3 filesystem for Hadoop. DataStax’ Brisk is a Hadoop distribution that replaces HDFS with Apache Cassandra’s CassandraFS.
To get around MapReduce’s first-in-first-out limitations, the Cascading framework gives developers an easier tool in which to run jobs and more flexibility to schedule jobs.
Hadoop is not always a complete, out-of-the-box solution for every Big Data task. MapReduce, as noted, is enough of a pressure point that many Hadoop users prefer to use the framework only for its capability to store lots of data fast and cheap.
But Hadoop is still the best, most widely used system for managing large amounts of data quickly when you don’t have the time or the money to store it in a relational database. That’s why Hadoop is likely to remain the elephant in the Big Data room for some time to come.
(See also ReadWrite’s Hadoop coverage.)