Imagine Hadoop clusters whose locales transcend both geographies and clouds, and whose contents can be addressed the same way as any other file. It could help bridge the current gap between big data clusters and regulated, relational databases.It could be the killer combination of server technologies: unified object storage with sharded, distributed big data.
Red Hat is planning such a move, as part of its ongoing beta of what’s now called Red Hat Storage 2.0 (RHS 2). The company’s Tom Trainer, a veteran of the storage industry, spoke with ReadWriteWeb about this latest unreported revolution.
Making Hadoop One Less Silo
“It’s more than compatibility; it’s a new and innovative way to access machine-generated data, where it’s been lumped and siloed away in multiple HDFS silos,” Trainer said. He’s referring to the Hadoop File System, the big data architecture’s fault-tolerant, distributed file system. “Now there’s a new door, a new way of looking in and shed light on those files, and move them around the enterprise as objects very quickly... As we see it, Storage 2.0 enables storing both HDFS files and now NFS and CIFS as well, and then also object storage capability.”
RHS 2, he continues, “will be able to take Hadoop files out as groups of files, [but] as objects, and export them to other environments to exploit the data within those files in new and creative ways. Information accessibility in the Hadoop environment is now broadened with Red Hat Storage 2.0.”
In the Hadoop architecture, the NameNode is the server responsible for managing the names, metadata and locations of all the Hadoop data clusters in the system, wherever they may reside. Its architecture is actually fairly simple, and from Trainer’s point of view, a little too fundamental. While it can map the identity of a cluster to multiple locations, thereby enabling very simple and even robust data duplication, the file system it’s based on is rather basic, borrowing perhaps too much from the older world of file storage, when names like Novell ruled.
In RHS 2.0, Red Hat’s engineers have come up with a way, Trainer explains, for the object storage mechanism (which he still calls GlusterFS) to either coexist with HDFS or replace it altogether. The latter, he says, may be preferable: “That completely eliminates the name node in the architecture of the file system... and thereby changes the overall performance characteristic of the Hadoop environment, and also changes the information accessibility characteristic of the Hadoop environment.”
The Competing View of Unified Storage
EMC also uses the phrase “unified storage” to refer to its architecture. In January 2011, that company unveiled its VNX system, with the intent of letting customers merge storage area networks and network-attached storage systems into one pool - all of it bearing the EMC brand, of course. EMC probably didn’t expect to find itself competing, in about a year’s time, with a pure-play software company whose private cloud strategy is built around existing, prevalent, commodity hardware.
Trainer argues that in the EMC system, storage components may share the same pool, but they remain segregated. “We find many IT organizations have storage farms, if you will. Today, you have storage environments that may have been selected by upper management based on business relationships, business requirements, price or some unique feature that the storage hardware vendor had in the past. When they look at scale-out NAS requirements, and then look at what's available on the market, they primarily had a choice between specific storage hardware vendors - and there were pretty high costs associated with that. Today, they’re able to deploy lower-cost commodity storage and servers as scale-out, turnkey NAS, or they can redeploy some of their already-existing servers as storage assets - that’s a money-saver in itself.”
Red Hat is still accepting applications for companies interested in joining its managed RHS 2 beta program. As Trainer described it, these would be organizations that would be willing to set up a cache of existing hardware in a nonproduction environment. Although some customers are tempted to try the beta in a production environment - for instance, storing multiple unstructured files, such as videos - Red Hat advises against this.
“At Red Hat, we have a Storage Compatibility List, and that’s typically for production-level products. In our beta process, we have used it as a guide for our beta customers to indicate the kinds of server and storage environments we’re looking to beta test on,” Trainer said. While some beta testers agree, other customers have presented brands of equipment that fall outside the compatibility list, some of which meet Red Hat’s requirements for the test.
A New Security Risk?
In recent days, RWW has heard from analysts and experts around the idea that merging a new object storage model with a relatively new data model could create a potential hazard, the risks for which security companies have yet to fully fathom.
We put that notion to Red Hat’s Tom Trainer. “Having previously worked for a three-letter, monolithic storage company,” he responded, “it’s very easy to throw FUD out there and say, ‘Oh my gosh, this is new and untested and unproven, and there are security concerns and potential holes!’ That phrase holds true for every new and innovative technology that’s ever been released.”
If the fountain of all network security spouts forth from security companies reacting to exploits, then possibly every innovation is a security risk, at least for a period of time. This is one reason beta testing was invented in the first place. Red Hat has a good history with managed beta programs, and Trainer says his company is working with testers to redefine firewall boundaries and redirect workflows so that security may be innovated along with data access.