Those new to open source won’t remember just how much of the early code amounted to little more than crappy-but-free clones of popular proprietary products. Boy, how times have changed.
Open source, once a clumsy (but free!) imitator of proprietary innovation is now doing taking the lead on industry innovation, with Big Data being the most obvious example. While this is a hugely positive industry shift, it also introduces complexities. Namely, with so much exceptional open source software contending to power your next Big Data project, how do you choose which to use?
Opening Up Innovation
Black Duck Software recently named its annual “Open Source Rookies of the Year,” pulling data from thousands of projects relative to project activity, commits pace, project team attributes, and other factors. Spanning cloud and virtualization, mobile, social media and more, they reflect the ever-increasing scope of code that is successfully developed in the open, rather than behind closed doors.
See also: Why Your Company Needs To Write More Open-Source Software
Nowhere is this trend more evident than in Big Data.
As Cloudera co-founder Mike Olson declares, “No dominant platform-level software infrastructure has emerged in the last ten years in closed-source, proprietary form.” That’s a stunning assessment, but it’s absolutely true. Open source may have come to life as an imitator, but it’s innovating at a frenetic pace in Big Data land.
Which may be a problem.
Spoiled By Open Source Riches
Big Data projects are now being released at such a frenetic pace that developers struggle to keep up. In case you’re just getting your feet wet with Hadoop, for example, you now need to consider Spark, Samza or a variety of other oddly-named but increasingly important Big Data tools.
Importantly, these tools are largely being born within enterprises like LinkedIn that have serious Big Data needs that no commercial software can solve. Even the National Weather Service has jumped in, open sourcing the code that powers its global forecast system.
While most companies won’t need such niche code, they may want the sorts of things released by the big Web companies. Take for instance, LinkedIn’s release of Apache Samza:
The LinkedIn-developed framework is designed to process complex real-time workloads that require special handling after ingestion. It embeds a local key-value store in every stream that makes it possible to store the kind of contextual information needed to carry out advanced operations such as merging datasets locally instead of having to query a remote system every time they’re needed.
This leads to fantastic performance. It also leads to the question: what should a developer use to tackle her organization’s data load?
On the database side, there are hundreds of options, ranging from NoSQL databases like MongoDB and Cassandra to relational mainstays like Oracle and MySQL. Should a developer choose the most popular database, picking from a list like DB-Engines’ ranking? That’s one approach, but you could easilyend up with a big mismatch between the workload and the tool managing it.
If this seems like a trivial problem, it’s not. At all. I spent years working for Big Data infrastructure providers, and now work for a company trying to make sense of the deluge of open source Big Data tools. It’s hard to keep up, and very difficult to know which to use.
Closing Off Choices
One reason that Amazon Web Services (AWS) has become the go-to public cloud is that the company has managed to simultaneously offer a broad array of open source solutions to run (supported and unsupported) on its cloud, and a suite of proprietary services for everything from email to data warehousing.
Developers, anxious to “get stuff done,” can turn to AWS and know that they’ll have both a variety of options and the safety of a paved path.
Microsoft Azure has followed suit. Not content to roll out a Hadoop-based analytics service, for example, Microsoft is now close to releasing Cosmos, its parallel processing and storage service. Or take the company’s support for MongoDB, an open source document database, to appeal to those that want the popular NoSQL database. At the same time, Microsoft has rolled out its own document database as a service, for those that want a document database but may prefer Microsoft’s packaging of it.
Microsoft, in short, wants to provide choice to its customers, but curated and nicely packaged.
This looks like the future of open source infrastructure: free to download, but perhaps more useful rolled into a cloud service that removes complexity (and choice). It may not be what the open source crowd would prefer, but it may end up being the ideal way to turn open source Big Data innovation into solutions mainstream enterprises can actually use.
Photo by George Thomas