This New Open Source Project Is 100X Faster than Spark SQL In Petabyte-Scale Production

Baidu, like Google, is much more than a search giant. Sure, Baidu, with a $50 billion market cap, is the most popular search engine in China. But it’s also one of the most innovative technology companies on the planet. 

Also like Google, Baidu is exploring autonomous vehicles and has major research projects underway in machine learning, deep translation, picture recognition, and neural networks. These represent enormous data-crunching challenges. Few companies manage as much information in their data centers.

In its quest to dominate the future of data, Baidu has attracted some of the world’s leading big data and cloud computing experts to help it manage this explosive growth and build out an infrastructure to meet the demands of its hundreds of millions of customers and new business initiatives. Baidu understands peak traffic hammering on I/O and stressing the data tier. 

Which is what makes it so interesting that Baidu turned to a young open source project out of UC Berkeley’s AMPLab called Alluxio (formerly named Tachyon) to boost performance.

Co-created by one of the founding committers behind Apache Spark — also born at AMPLab — Alluxio is suddenly getting a lot of attention from big data computing pioneers that range from the global bank Barclays to Alibaba and engineers and researchers at Intel and IBM. Today Alluxio released version 1.0, bringing new capabilities to this software that acts like a programmable interface between big data applications and the underlying storage systems, delivering blazing memory-centric performance. 

Shaoshan Liu

I spoke to Baidu Senior Architect Shaoshan Liu about his experiences running Alluxio in production to find out more.

ReadWriteWhat problem were you trying to solve when you turned to Alluxio?

Shaoshan Liu: How to manage the scale of our data, and quickly extract meaningful information, has always been a challenge. We wanted to dramatically improve throughput performance for some critical queries.

Due to the sheer volume of data, each query was taking tens of minutes, or even hours, just to finish — leaving product managers waiting hours before they could enter the next query. Even more frustrating was that modifying a query would require running the whole process all over again. About a year ago, we realized the need for an ad-hoc query engine. To get started, we came up with a high-level of specification: the query engine would need to manage petabytes of data and finish 95% of queries within 30 seconds.

We switched to Spark SQL as our query engine. Many use cases have demonstrated its superiority over Hadoop MapReduce in terms of latency. We were excited and expected Spark SQL to drop the average query time to within a few minutes. But it did not quite get us all the way. While Spark SQL did help us achieve a 4-fold increase in the speed of our average query, each query still took around 10 minutes to complete.

Digging deeper, we discovered our problem. Since the data was distributed over multiple data centers, there was a high probability that a query would hit a remote data center in order to pull data over to the compute center: this is what caused the biggest delay when a user ran a query. It was a network problem. 

But the answer was not as simple as bringing the compute nodes to the data center.

RW: What was the breakthrough?

SL: We needed a memory-centric layer that could provide high performance and reliability, and manage a petabyte scale of data. We developed a query system that used Spark SQL as its compute engine, and Alluxio as the memory-centric storage layer, and we stress-tested for a month. For our test, we used a standard query within Baidu, which pulled 6TB of data from a remote data center, and then we ran additional analysis on top of the data.

The performance was amazing. With Spark SQL alone, it took 100-150 seconds to finish a query; using Alluxio, where data may hit local or remote Alluxio nodes, it took 10-15 seconds. And if all of the data was stored in Alluxio local nodes, it took about five seconds, flat — a 30-fold increase in speed. Based on these results, and the system’s reliability, we built a full system around Alluxio and Spark SQL.

RW: How has this new stack performed in production?

SL: With the system deployed, we measured its performance using a typical Baidu query. Using the original Hive system, it took more than 1,000 seconds to finish a typical query. With the Spark SQL-only system, it took 300 seconds. But using our new Alluxio and Spark SQL system, it took about 10 seconds. We achieved a 100-fold increase in speed and met the interactive query requirements we set out for the project.

In the past year, the system has been deployed in a cluster with more than 200 nodes, providing more than two petabytes of space managed by Alluxio, using an advanced feature (tiered storage) in Alluxio. This feature allows us to take advantage of the storage hierarchy, e.g. memory as the top tier, SSD as the second tier, and HDD as the last tier; with all of these storage mediums combined, we are able to provide two petabytes of storage space.

Besides performance improvement, what is more important to us is reliability. In the past year, Alluxio has been running stably within our data infrastructure and we have rarely seen problems with it. This gave us a lot of confidence. 

Indeed, we are preparing for larger scale deployment of Alluxio. To start, we verified the scalability of Alluxio by deploying a cluster with 1,000 Alluxio workers. In the past month, this cluster has been running stably, providing over 50 TB of RAM space. As far as we know, this is the largest Alluxio cluster in the world.

Facebook Comments