Guest author Peter Schlampp is the vice president of products at Platfora, a “Big Data” analytics platform provider.
Apache Spark is quickly becoming a core technology for big data analytics in a surprisingly short period of time. This may lead cautious types to wonder if it will fade out just as quickly, which happens all-too-often in technology. On the contrary, I believe Spark is just getting started.
See also: The Big-Data Tool Spark May Be Hotter Than Hadoop, But It Still Has Issues
Over the past couple of years, as Hadoop exploded and “big data” became dominant, several things have become clear: First, the Hadoop Distributed File System (HDFS) is the right storage platform for all that data. Second, YARN (for resource allocation and management) is the framework of choice for those big data environments.
Third, and perhaps most importantly, there is no single processing framework that will solve every problem. Although MapReduce is an amazing technology, it doesn’t address every situation.
Spark, however, addresses many of the key issues in big data environments, which has helped fuel its phenomenal rise. It’s one reason my company, Platfora, has bet big on it. Our “Big Data Discovery” platform uses Apache Spark as an underlying technology to process and analyze big data, despite its young age. Here’s why.
We May Be Nearing The Age Of Spark
Organizations that rely on Hadoop need a variety of analytical infrastructures and processes to find the answers to their critical questions. They need data preparation, descriptive analysis, search, and more advanced capabilities like machine learning and even graph processing.
Companies need a toolset that meets them where they are, allowing them to leverage the skill sets and other resources they already have. Until now, a single processing framework that fits all those criteria has not been available.
This, however, is the fundamental advantage of Spark, whose benefits cut across six critical areas for companies that deal in the business of big data.
Advanced Analytics
Many large and innovative companies are looking to expand their advanced analytics capability. And yet, at a recent big data analytics event in New York, only 20% of the participants reported that they’re currently deploying advanced analytics across their organizations.
The other 80% said that their hands are full just preparing data and providing basic analytics. The few data scientists they have spend most of their time implementing and managing descriptive analytics.
Spark offers a framework for advanced analytics out of the box. It includes a tool for accelerated queries, a machine learning library, a graph processing engine, and a streaming analytics engine. Instead of trying to implement these analytics via MapReduce—which can be nearly impossible, even with hard-to-find data scientists—Spark provides pre-built libraries, which are easier and faster to use.
This frees the data scientists to take on tasks beyond just data preparation and quality control. With Spark, they can even ensure correct interpretation of the analysis results.
Simplification
One of the earliest criticisms of Hadoop wasn’t just that it was hard to use, but that it was even harder to find people who could do it. Although it has gotten simpler and more powerful with every subsequent iteration, this complaint has persisted to this day.
Instead of requiring users to understand a variety of complexities, such as Java and MapReduce programming patterns, Spark was built to be accessible to anyone with knowledge of databases and some scripting skills (in Python or Scala).
For businesses, it is much easier to find people who can understand your data as well as the tools to process it. For vendors, we can develop on top of Spark and bring new innovation to businesses faster.
Multiple Languages
SQL doesn’t address all the challenges of big data analytics, at least not on its own. We need more flexibility in getting at the answers, more options for organizing and retrieving data and moving it quickly into an analytics framework.
Spark leaves the SQL-only mindset behind, opening the data up to the quickest and most elegant way of moving into analysis, whatever it might be.
Faster Results
As the pace of business continues to accelerate, so does the need for real-time results.
Spark provides parallel in-memory processing that returns results many times faster than any other approach requiring disk access. Instant results eliminate delays that can significantly slow business processes and incremental analytics.
As vendors begin to build applications on Spark, dramatic improvements to the analyst workflow will follow. Accelerating the turnaround time for answers means that analysts can work iteratively, honing in on more precise, and more complete, answers. Spark lets analysts do what they are supposed to do—find better answers faster.
No Discrimination Or Preference For Hadoop Vendors
All of the major Hadoop distributions now support Spark, and with good reason: It’s vendor-neutral, which means it doesn’t tie the user to any specific provider.
Due to Spark’s open-source nature, businesses are free to create a Spark-based analytics infrastructure without worrying about what happens if they change Hadoop vendors later. If they make a switch, they can bring their analytics with them.
High-Growth Adoption
Apache Spark achieved momentum in a very short time. Late in 2014, it tied at first place for a world record in sorting at the Daytona Gray Sort 100TB Benchmark.
Whenever a service, product or technology quickly grabs attention, there’s usually a rush to tear it down—whether to deflate the hype, reveal the bugs or otherwise debunk its promise.
But according to a recent survey by Typesafe, awareness of Spark is only growing. Covering a sample of more than 2,100 developers, the report showed that 71 percent of respondents have had some experience with the framework. Today, it has reached more than 500 organizations of all sizes, which are committing thousands of developers and extensive resources to the project.
Spark hasn’t yet solidified its position as one of the fundamental technologies for big data analytics environments, but it’s well on its way. In other words, this is just the beginning.
Lead photo by Chris Young