Apache has dished out another serving of Cassandra, the open source NoSQL database popular for handling big data. The improvements speak to a maturing NoSQL database that’s well-suited for big data deployments. This time around, Cassandra has improvements to its query language, and tuning improvements that will help companies trying to boost performance with a mixture of magnetic media and solid state drives (SSD). Its continued development helps maintain open-source dominance in the big data/NoSQL market.
Cassandra 1.1 hits just a bit more than six months after Apache released Cassandra 1.0, in October 2011. The major features in 1.1 point to Cassandra’s focus on very large data sets.
Jonathan Ellis, vice president of the project and CTO of DataStax, pointed to several features that make 1.1 more than just a minor update. One of the most interesting is Cassandra’s support for intelligently mixing magnetic and SSD media.
Ellis says that a Cassandra deployment may have some tables that are updated more frequently than others, so it makes sense to put some tables on magnetic media (which is much slower) and other tables on SSD. Prior to the 1.1 release, Cassandra had no way of distinguishing between the two. This meant that if you mixed media, you could have very uneven results. The alternative, going all SSD or all spinning disks, was either very expensive (SSD) or much slower (magnetic media).
Cassandra deployments can hit hundreds of terabytes of data. The largest (known) production cluster, according to Apache, exceeds 300TB of data spread out across 400 machines. Investing in 300TB of SSD can be very pricey and doesn’t make much sense if only some of the data needs to be on SSD.
Another biggie in this release, says Ellis, is support for better self-tuning for performance. With this release, Cassandra self-tuning support has been extended to its caching layer.
Speak My Language
The Cassandra Query Language (CQL) has also been updated. Ellis says that one of the major improvements to CQL is the addition of composite primary keys, a feature that lets developers define more than one primary key per table. Ellis says that this helps to create better views of data and appeals mostly to organizations that are already using Cassandra.
As CQL matures, it has adopted quite a bit from SQL. However, Ellis says that CQL won’t be a clone of SQL in the long run, as some features in SQL simply don’t make sense for a distributed database like Cassandra. The most obvious feature, says Ellis, is joins. “We don’t do joins. It’s a bad idea across multiple machines in a cluster. Some people think that the takeaway is that you do joins in the application instead of the database, which is the wrong idea. Whether you do it in the app or the database, it’s not a good idea in a distributed world.”
“In other words,” says Ellis, “we’re not looking to make Cassandra an OLTP [online transaction processing] hybrid. We’re keeping focused on parts that support a real-time workload. For analytics, we point to Hadoop and Hive support.”
I’ll Take the High Road, MongoDB Can Take the Low Road
How’s Cassandra doing in the bid for mindshare versus MongoDB? If you check out GitHub or Stack Exchange, which often provide an indication of which technologies are the most popular, you’ll see that MongoDB seems to have more developer interest. For example, if you look for repositories that turn up when searching for “Cassandra” on GitHub, you’ll find 535. Searching for MongoDB shows nearly 2,500. Not the most scientific survey, but there’s very little data so far on NoSQL deployments – and being open source, it’s impossible to gauge accurately.
Ellis says that this makes sense, as Cassandra is geared much more toward the high end, while MongoDB is well-suited for “grassroots developers.”
“They’re [MongoDB] going after millions of deployments; Cassandra is going after thousands of deployments. We’re going after a market where your data doesn’t fit on one machine… our users are Adobe, Netflix, HBO and Twitter. Companies with lots of data.”
What’s really interesting about Cassandra, MongoDB and other so-called NoSQL databases is how open source projects have effectively sewn up the space. All of the relevant projects are open source, though they may have proprietary variants shipped by vendors that support them.
For many years, open source was seen as a trailing effort to proprietary projects. In the big data/NoSQL space, this has been turned on its head. Cassandra is a really good example of how openness is leading the development of next-generation infrastructure technology.