Not long ago at all, Oracle laid claim to building the systems that managed a majority of the world's data. This year, the group making the same claim is a spinoff from Yahoo.
The onset of Internet-size databases and cloud architectures brought about architectural quandaries about the nature of relational databases that no one had ever thought they'd have to ponder just a few years earlier. Making tremendous strides in this department in a very short period of time -- literally last June -- is Hortonworks, the newly independent company that produces the Apache-licensed open source database system Hadoop, and the latest partner of Microsoft. This week, ReadWriteWeb talks with Hadoop co-creator and Hortonworks CEO Eric Baldeschwieler.
Our topics: How Hadoop will integrate with Windows Server; whether the structure of data is better suited for Hadoop than other systems, as some NoSQL proponents claim; and the surprising revelation that Baldeschwieler does not perceive Hadoop as an integral part of the NoSQL camp. This despite the fact that Hadoop and NoSQL are often said in the same breath.
Scott Fulton, ReadWriteWeb: Do you perceive at this point Hadoop as a role within Windows Server, or deserving of a role, in the same way that Domain Name Server is a role and Internet Information Server is a role [within Server Manager]?
Eric Baldeschwieler: It's different, in that Hadoop is assembled out of a number of computers. It's kind of the opposite, where I think Windows Server is a perfectly fine component for building a Hadoop cluster. A Hadoop cluster needs to be built on a set of machines that have an operating system. I don't think that Hadoop becomes a service of Windows; rather, you build Hadoop services out of collections of computers, and they could run Windows or they could run whatever other operating system the organization is most comfortable with.
The exciting thing about this partnership [with Microsoft] to us is that it makes Hadoop accessible to a large number of organizations that have selected Windows as their preferred operating system.
RWW: So rather than as a function or a unit of Windows, you prefer to see it as a function of the companies that have put together a collection of all their computers, all of which happen to run Windows?
EB: All of which may happen, yes. I would think of it in similar terms to a cloud offering. I think in some sense Hadoop is a data cloud. Once you've assembled the Hadoop service, you shouldn't care too much what operating system it runs on. Now, you may choose to use applications that have been customized for Windows with Hadoop... There are opportunities to do differentiated things, but Hadoop itself is a very agnostic OS. The services that we're going to build for Hadoop will continue to be operating system-agnostic.
Is cloud data different from "regular" data?
RWW: Is there something intrinsic to the nature of structured data, the class we typically see with RDBMS like SQL Server, Oracle, DB2, that makes it unworkable in a cloud configuration, where it's spread out over multiple processors and devices, and which makes Hadoop suitable as the alternative?
EB: There's nothing about the data per se. What I would say is, Hadoop is much, much simpler than competitive offerings. It's really a very simple layer that's been built from the bottom up with two goals: One is massive scale-out. None of these traditional systems were built with a design point of reaching thousands, or tens of thousands, of nodes. Competing databases, for example, will go up to a dozen or two dozen nodes in a huge installation. So Hadoop was built from the bottom up to be really broadly scalable.
Another differentiation is, Hadoop was built from the bottom up to be built on very low-cost commodity hardware. It's built with the assumption that the hardware it's running on will break, and that it must continue to work even if the hardware breaks. Those two goals of the Hadoop design -- being able to run on thousands of nodes, and being able to run on commodity hardware -- are just very different design points than those traditional systems. Traditional systems have basically exported that part of the problem down to very complicated hardware solutions. There are many, many problems which can be solved much more simply if you're not thinking about scale-out... It's like looking at two animals that evolved from a very, very different ecosystem.
RWW: There are folks who draw a dotted line around the concept of NoSQL, and put the Hadoop elephant squarely in the middle of that. They tell me the basic principle of NoSQL is that big data should never have been designed for a structured query language. It should not be designed for unions and joins and all the mathematical extrapolation; big data was meant to be simpler, key/value pairs, and that architecture of key/value should be embraced as something that all data should eventually follow. Would you agree with that, even partly?
EB: Not as stated. To begin with, Hadoop supports several data processing languages that have exactly the primitives you listed. Hive supports a subset of SQL directly; and Pig supports all of the classic table operators as well -- joins, unions, all the sets, and relational operators. So we're not at all adverse to doing relational algebra and that kind of processing. Hadoop is also really a processing engine.
I think, in general, when you describe the NoSQL movement, I think more of real-time stores such as MongoDB, Riak, or Redis -- there's dozens of these guys. Hadoop is a slightly different beast. The problem they're trying to solve is how to respond to queries in milliseconds with a growing core of data. Hadoop is different again because it's not about millisecond response; Hadoop is really focused on batch processing. Hadoop can really lower the price point by using commodity hardware and by making assumptions that your work will have a certain amount of latency to it, it's very data scanning-intensive instead of seeking-intensive. SQL databases, and even NoSQL stores, are architected so that you can pull up a particular row of data very, very quickly. Hadoop is architected so that you can scan a petabyte of data very, very quickly. There's different design points.
With all these things, I think many of these systems defy simple categorization. As Hadoop evolves, it's able to do a better job with lower-latency work. As databases evolve, they're able to scale to larger systems; and as NoSQL stores evolve, they're able to handle more complex queries.
The place where the other systems exceed what Hadoop can do is in what I would call high-bandwidth random access. If you really need to pull a set of unrelated key/value pairs out of your data as fast as possible, other systems are going to be better. But if you want to scan the last year's data, and understand or find new patterns, or run a machine-learning algorithm to infer the best way to personalize pages for users -- lots and lots of data, where the requirement is to process all the data, that's where Hadoop really excels.
Complementary, not transitional
RWW: Microsoft's folks tell me they're working on a system, that they'll be putting in beta shortly, to help users transition databases for use with Hadoop. I was going to ask, why wouldn't all databases qualify for this transition? But if I understand what you're saying correctly, there's still something to be said for the type of analysis where you're looking for the one thing in a million, the needle in the haystack, instead of a sequential read?
EB: Yea, there are workloads that traditional databases are going to do more cost-effectively than Hadoop. There's no doubt about that. You're not going to see people replacing transactional databases with Hadoop, when what they're trying to build is transactional systems. What does make sense in nearly all cases that I'm aware of is to take a copy of the data you have out of your transactional systems, and also store them on Hadoop. The incremental cost of doing that is very low, and it lets you use that data for the things that Hadoop is good for.
For example, what we're seeing a lot of at Yahoo is, data lives in transactional systems for weeks and months -- maybe 30, 60, 90 days. Because that's how long it's needed to serve production needs. But then there's data which should be kept for one, two or more years inside Hadoop. That lets you do investigation at a really granular level year-over-year, multi-year. It really lets you understand the data in a different way, to have it all available. Whereas the online [data] simply isn't all available because it's not cost-effective to keep petabytes of data in your transactional system. And even if it is, one of the things we see a lot of in production use cases is, the kind of work that people want to do with Hadoop can just crush one of those traditional systems.
So you see people partitioning the work; for example, there's a production system that was both serving user requests that required a transactional database, and that's also doing reporting -- aggregate activity of the users over [several] days, for example. What happened was, that reporting function was just bogging down the database. It was no longer able to meet its transactional requirements in a timely manner, as the load grew. So what they did was move the aggregational work into Hadoop, and left the transactional work in the transactional database. Now they're getting their reports more quickly and they're fielding many more transactions per second off the existing traditional system.
RWW: So there are complementary use cases that can not only be explained but diagrammed for people -- how they can partition their systems so, when they need that granular observation over one to two years, they have a way --
EB: Absolutely. The same applies to business analysis systems. Those also don't scale [very well]. So what we see is, people will use Hadoop to generate a new perception of the data that's useful in these systems, [which] can then be used by analysts who are trained on those tools already to explore that data. That's something that Yahoo does a lot of; they'll explore what they like to call "segments" of the data using traditional tools, but they can only host a tiny fraction of all their data in the data marts and queues. So they'll use Hadoop to produce that view, and load it into a data mart... It's not a case of Hadoop replacing it, but using Hadoop where it complements the other system well.