The Big-Data Tool Spark May Be Hotter Than Hadoop, But It Still Has Issues

Hadoop is hot. But its kissing cousin Spark is even hotter.

Indeed, Spark is hot like Apache Hadoop was half a decade ago. Spawned at UC Berkeley’s AMPLab, Spark is a fast data processing engine that works in the Hadoop ecosystem, replacing MapReduce. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and iterative algorithms, like those commonly found in machine learning and graph processing.

San Francisco-based Typesafe, sponsors of a popular survey on Java developers I wrote about last year and the commercial backers of Scala, Play Framework, and Akka, recently conducted a survey of developers about Spark. More than 2,000 (2,136 to be exact) developers responded. Of the findings, three conclusions jump out:

Spark awareness and adoption are seeing hockey-stick-like growth. Google Trends confirms this. The survey shows that 71% of respondents have at least evaluation or research experience with Spark, and 35% are now using it or plan to use it.
Faster data processing and event streaming are the focus for enterprises. By far the most desirable features are Spark’s vastly improved processing performance over MapReduce (over 78% mention this) and the ability to process event streams (over 66% mention this), which MapReduce cannot do.
Perceived barriers to adoption are not major blockers. When asked what’s holding them back from the Spark revolution, respondents mentioned their own lack of experience with Spark and the need for more detailed documentation, especially for more advanced application scenarios and performance tuning. They mentioned perceived immaturity, in general, and also integration with other middleware, like message queues and databases. Lack of commercial support, which is still spotty even by the Hadoop vendors, was also a concern. Finally, some respondents mentioned that their organizations aren’t in need of big data solutions at this time.

I spoke to Typesafe’s architect for Big Data Products and Services, Dean Wampler (@deanwampler), on his thoughts about the rise of Spark. Wampler recently recorded a talk on why he thinks Spark/Scala are rapidly replacing MapReduce/Java as the most popular Big Data compute engine in the enterprise.

Striking The Spark

ReadWrite: For those venturing into Spark, what are the most common hurdles?

Wampler: It’s mostly around things like acquiring expertise, having good documentation with deep, non-trivial examples. Many people aren’t sure how to manage, monitor, and tune their jobs and clusters. Commercial support for Spark is still limited, especially for non-YARN deployments. However, even among the Hadoop vendors, support is still spotty.

Spark still needs to mature in many ways, especially the newer modules, such as Spark SQL and Spark Streaming. Older tools, like Hadoop and MapReduce, have had a longer runway and hence more time to be hardened and expertise to be documented. All these issues are being addressed and they should be resolved relatively soon.

RW: I hear people ask “where are you running Spark?” all the time, suggesting a pretty broad range of resource management strategies, e.g., standalone clusters, YARN, Mesos. Do you believe industry will tend to run Big Data clusters in isolation, or do you see the industry eventually moving to running Big Data clusters alongside other applications in production?

DW: I think most organizations will still use fewer, larger clusters, just so their operations teams have fewer clusters to watch. Mesos and YARN really make this approach attractive. Conversely, Spark makes it easier to set up small, dedicated clusters for specific problems. Say you’re ingesting the Twitter firehose. You might want a dedicated cluster tuned optimally for that streaming challenge. Maybe it forwards “curated” data to another cluster, say a big one used for data warehousing.

Keeping The Spark Alive

RW: Is the operations side of Spark different than the operations side of MapReduce?

DW: For batch jobs, it’s about the same. Streaming jobs, however, raise new challenges.

For a typical batch job, whether it’s written in Spark or MapReduce, you submit a job to run, it gets its resources from YARN or Mesos, and once it finishes, the resources are released. However, in Spark streaming, the jobs run continuously, so you might need more robust recovery if the job dies, so stream data isn’t lost.

Another problem is resource allocation. For a batch job, it’s probably okay to give it a set of resources and have those resources locked up for the job’s life cycle. (Note, however, some dynamic management is already done by YARN and Mesos.) Long-running jobs really need more dynamic resource management, so you don’t have idle resources during relatively quiescent periods, or overwhelmed resources during peak times.

Hence, you really want the ability grow and shrink resource allocations, where scaling up and down is automated. This is not a trivial problem to solve and you can’t rely on human intervention either.

RW: Let’s talk about the Scala / Spark connection. Does Spark require knowledge of Scala? Are most people using Spark also well versed in Scala? And is it more the case that Scala users are those who tend to favor Spark, or is Spark creating a “pull” effect into Scala?

DW: Spark is written in Scala and it is pulling people towards Scala. Typically they’re coming from a Big Data ecosystem already, and they are used to working with Java, if they are developers, or languages like Python and R, if they are data scientists.

Fortunately for everyone, Spark supports several languages – Scala, Java, Python, and R is coming. So people don’t necessarily have to switch to Scala.

There has been a lag in the API coverage for the other languages, but the Spark team has almost closed the gap. The rule of thumb is that you’ll get the best runtime performance if you use Scala or Java, and you’ll get the most concise code if you use Scala or Python. So, Spark is actually drawing people to Scala, but it doesn’t require that you have to be a Scala expert.

I like the fact that Spark uses the more mainstream features of Scala. It doesn’t require mastery of more advanced constructs.

Photo courtesy of Shutterstock