A select group of developers is being invited by Hortonworks, the commercial caretaker of Apache Hadoop, to join a limited technology preview of the company's own forthcoming cloud data platform based on Hadoop. If you haven't heard of it yet, this is either your first time on ReadWriteWeb, or you've been living in a desert with no elephants. It's the NoSQL database system born from a Yahoo project, and now Hortonworks wants businesses to be able to utilize it as a platform without having to install it in their own data centers.
Hortonworks Data Platform, as the company's CEO tells RWW, will enter a public preview phase later this year. At that time, availability and ease of deployment will no longer be adequate excuses for businesses not wanting to try moving their big data to a scalable management platform.
It will be called Hortonworks Data Platform (HDP), and it will not make the mistake of offering only the basic service on an a la carte business model. The company intends to make third-party support and services available on the platform as well, including the tools, examples, and help that customers may require to get on board.
"We're a long play. Hortonworks as a company, and all of us as individuals, believe that for Hadoop to be everything it can be, we have to build a broadly accepted, horizontal platform for processing and storing data," says Eric Baldeschwieler in an interview with RWW. "And for that to succeed, there needs to be a vibrant ecosystem of services, software, and hardware to integrate into that standard platform."
A new cloud, featuring an elephant, a bee, and a pig
The Hadoop portal, if you will, will be HCatalog. It brings customers into the platform by enabling them to store data in a more conventional, table-like representation, based on the work of the Apache Hive project. However, as Baldeschwieler tells us, this particular implementation will let data be shared between Hive (which utilizes a SQL-like query language), Pig (which provides data analysis tools), and MapReduce (which enables parallel access of very large clusters).
The CEO tells us that as a data representation layer, HCatalog will be expandable, in ways we've not seen before with Hadoop. "You can store those tables in Hadoop today, but in the future, we anticipate working with partners to integrate third-party data sources into those MapReduce engines, and be used against a local Hadoop cluster or against other parallel data sources in your organization, and other databases or object stores can be integrated."
In time, the HCatalog layer will support multiple programming languages and paradigms, the CEO tells us. This inclusion, the company hopes, will spark interest in an ecosystem of systems integrators and service providers, 16 of which have already signed up as charter members of HDP's partner program.
Standardizing Hadoop deployment
The HDP rollout will mark the debut of Ambari, an Apache incubator project, that will serve as a Hadoop installation and management system. Its purpose will be to expedite installations and upgrades for Hadoop, especially among multiple clusters.
"Long term, the ambition is also to address problems with administration and monitoring," Baldeschwieler says, "so that Ambari will be completely a open source, Apache [project]... with open APIs that can integrate into partner systems that customers have already selected to use to manage their data centers."
All these advancements made with respect to HDP, remarks the CEO, will apply to Hadoop across the board; nothing will be exclusive to one vendor's implementation.
"At this point, Hortonworks does not even have an independent software repository," he reminds us. "We write all of our code directly in Apache. That's a key differentiator for us. We really believe that Hadoop should be complete and should be free. So Hortonworks Data Platform will be completely open source, Apache software."
At last, security and functionality from one source
HDP's implementation of Hadoop will be based on version 0.20.205, which Baldeschwieler says will be the first to support both security and HBase (Apache's random-access Hadoop database) simultaneously. Previously, security features had only been available as bolt-ons to Apache HBase, he admits. While both have been available to the Hadoop community for some time, he explains, since security was developed by the Yahoo team, it wasn't available through the Apache channel, and even then not for the stable release. That changes with HDP, with help from others in the Hadoop community, including Facebook.
From now on, the CEO adds, future releases of Hadoop will be released entirely from Apache on a quarterly basis. He concedes that there was some degree of fragmentation and confusion up to now, on account of which contributor was responsible for releasing what component.
At about this time 12 months ago, the people who would become Hortonworks' management team began discussions with business capital sources about spinning off from Yahoo and forming a company. Now Eric Baldeschwieler is not only a CEO, but the head of the most aggressive and fastest-growing database system in today's market.
"Previously, there had been a couple of Hadoop companies, but it was very clear from their business models and their org structures that they weren't focused on building Hadoop," he tells RWW. "They were relying on us to do that. But last year when I met Rob Bearden [from Benchmark Capital, now Hortonworks COO], it was clear that here was the ability to marry a business team that understood open source software with our technical team, to build this company focused not just on packaging and distributing Hadoop, but advancing it."
The HDP technology preview will be rolled out to the general public later this year, with general availability of the release edition of the platform in the early part of 2012.