Big Data Depends On Big Community, Not Big Money

There’s hot. And then there’s smoking hot.

Earlier this year, I wrote that Apache Spark was hot like Apache Hadoop was hot five years ago. As vendors gather at Spark Summit, a data-science conference taking place in San Francisco this week, and developers crowd birds-of-feather sessions, the tea leaves suggest that this MapReduce replacement for big-data computation may be outpacing its antecedents.

In the next five years, it’s almost certain that Apache Spark will be at the foundation of how most organizations manage datasets and power computation in large-scale production environments.

I spoke to Dean Wampler, Typesafe’s architect for big data products and services, on Spark’s progress. I asked him why open source frameworks in general are keeping the traditional big-systems vendors on the sidelines in the new big-data stack.

ReadWrite: What’s the latest evidence that Spark is on the rise?

Dean Wampler: It’s the number of Spark integrations that we’re seeing from a wide range of enterprise vendors, as well as the number of contributors to the Apache Spark project itself. The market follows the demand, and we saw this in Hadoop itself—where all kinds of traditional vendors of database tools and visualization tools were suddenly falling all over themselves to figure out different ways to integrate with Hadoop.

I think this is the strongest signal that a new framework has crossed Geoffrey Moore’s famous chasm into the “killer app” status—that thriving ecosystem support.

RW: What are the use cases driving the enthusiasm for Spark?

DW: I’ve argued for some time that the word “big” in big data put the focus on the quantity of data, but most enterprises are really wrangling data more on the terabyte level than the petabyte level.

For them, the real advantage is the ability to manipulate and integrate different data sources in a wide variety of formats. Overnight batch processing of large datasets was the start, but it only touches a subset of the market requirements for processing data.

Now, speed is a strong driver for an even broader range of use cases. For most enterprises that generally means reducing the time between receiving data and when it can be processed into information. This can include traditional techniques like joining different datasets, and creating visualizations for users, but in a more “real-time” setting. Spark really thrives in working with data from a wide variety of sources and in a wide variety of formats, from small to large sizes, and with great efficiency at all size scales.

Furthermore, developers get a much more efficient and flexible programming model compared to MapReduce, and non-developers even get SQL queries.

RW: Why do you think the industry discussions about the new big data stack have relatively little influence from the big systems vendors (IBM, HP, Oracle)?

DW: In general, I think what’s really special about this movement is that the power of the community—the number of committers, the rate of features and bug fixes—has greatly exceeded what would be possible for any one vendor to introduce by way of a proprietary platform.

In other words, you get the traditional benefits of a vibrant community focused on a popular open source software project

You tend to think of open source disrupting existing markets, but this streaming data/fast data movement was really born outside of commercial, closed-source data tools. First you had large Internet companies like Google solving the fast data problem at a scale never seen before, and now you have open-source projects meeting the same needs for a larger community.

For example, Apache Mesos emerged as a distributed-systems kernel, to improve cluster utilization in a lab environment at UC Berkeley’s AMPLab. And its success at abstracting low level tasks for managing infrastructure inspired the creation of Spark, which was initially built to run on top of Mesos—as the story goes—in less than a week, and in a cabin in Colorado. These were relatively young computer scientists, and their work would become the Berkeley Data Analytics Stack (BDAS) for big data processing.

I think the big systems players really are on the sidelines here, compared to the prominent role they played with bringing Linux to market. This wave of innovation around these Apache frameworks in big data is going to happen with or without the participation of those large vendors, and it carries a lot of implications for their future business models.

RW: How important is the question of where and how you run Spark?

DW: As you get into the realm of fast data, you start to see a lot of new challenges—or opportunities for performance improvement, depending on how you look at it.

The notion of data locality has become extremely important in big data—meaning, the proximity between jobs and data stores. You also want the frameworks you rely on to complement each other and collaborate in using resources efficiently and effectively.

It’s been common for human operators to manage jobs and frameworks manually across servers, but this simply does not scale, especially in a more dynamic world of fast data, where a single job might need to scale up and down on demand.

Similar to the way that Apache Spark has been gathering critical mass of ecosystem players, we see Mesos gathering steam as the ring to orchestrate all these frameworks and utilize cluster resources most efficiently. That’s one big reason TypeSafe has been working closely with Mesosphere.

RW: Spark is written in Scala, and Typesafe’s team were the original Scala language creators. What’s the connection between Scala and big data?

DW: It’s been my experience that toolkits like Spark—and predecessors like Scalding that Twitter wrote—really opened people’s eyes to how concise big data applications can be when you write in a functional language. Scala is a functional JVM language and tools like Spark make a strong case for Scala. I’ve met many people who otherwise didn’t see the point in using Scala, but suddenly got excited about it when they saw the Spark API for Scala.

The great thing about Spark is it also has a really great Java API, if you can’t switch languages. If you’re from the data science world and you’re used to Python and R, there’s a great API for Python, and a newly released R API.

A classic way we’ve had to work in the past is data scientists would model a problem in their favorite languages like R and Python, and then they would have to hand the model to a team of developers to port it to Java, so it could run in MapReduce. That was not only tedious, error-prone, and expensive in engineering hours, but it created a time delay in being able to deploy it.

Companies today can’t afford being late to market. Besides, it was an awful way to work. Now we’re starting to get closer to the model that people can write in the language they prefer.

Lead image courtesy of Shutterstock