Here's the problem: Data has already gotten too big for its britches. There are increasing corporate mergers and takeovers, greater pressure among businesses in both private and public sectors to consolidate resources, and to boot, federal regulations mandating privacy restrictions and security policies. Especially in the healthcare industry, the first "big data" technologies to emerge from the former Yahoo project that became Hadoop, have been a godsend.

Hadoop breaks simple data stores free from the bounds of single volumes, enabling them to be distributed in shards across multiple storage devices. Normally a database system hasn't had to deal with encryption. If you encrypt the volume it's stored on, that should be good enough - at least, that's what the U.S. Dept. of Commerce's NIST agency said in 2007 (PDF available here). But that was before the big data problem was even identified, and years before the first Yahoo teams went to work on it.

Is Encrypting the Wire "Good Enough?"

Up to now, the rule has been this: If a volume housing a database can be encrypted, and the encryption of that volume is handled at the operating system level, that should be good enough. This is sometimes called wire-level encryption, and it's handled at the file system level. If you can encrypt an NTFS volume, for instance, then anything that presents the unencrypted contents of that volume transparently to the database manager should be, well, good enough. Anyone who steals the hard drive ends up with nothing he can read.

According to guidance published by the American Medical Association (PDF available here), if you use a "good enough" disk encryption system using NIST's standards, then when there is a security breach, you are exempt from having to notify your patients. "While HIPAA-covered entities and their business associates are not required to follow this guidance," the AMA's report reads, "if your practice does follow the specified technologies and methodologies, you will avoid having to comply with the extensive notification requirements otherwise required by the HITECH Act in the event of a security breach."

So there's a little incentive for healthcare services to encrypt their data stores. This is where the government's best laid plans run smack into the great wall of progress. Up to now, the rule has been this: If a volume that includes clusters of Hadoop data can be encrypted, and the encryption of that volume is handled at the operating system level, that should be good enough. But as the documentation for CDH3, Cloudera's latest commercial implementation of Apache Hadoop, clearly indicates (PDF available here), the security for the system is presumed to be provided at the access level, where an individual is granted or denied access to the system.

"The security features in CDH3 meet the needs of most Hadoop customers because typically the cluster is accessible only to trusted personnel," the documentation reads. "In particular, Hadoop's current threat model assumes that users cannot: 1. Have root access to cluster machines; 2. Have root access to shared client machines; 3. Read or modify packets on the network of the cluster... It should be noted that CDH3 does not support data encryption. RPC data may be encrypted on the wire, but actual user data is not encrypted and there is no built-in support for on-disk encryption. For most current Hadoop users, this lack of data encryption is acceptable because of the assumptions stated above. However, if customers someday need data encryption, that functionality can be added later and the current security features are an important prerequisite for a complete security solution."

If an user of an encrypted volume can read the volume, then evidently he has access, whether by grant or by force. And there's the real problem, because NIST specifications were written at a time when the operating system took care of the whole accessibility problem. Hadoop's standard security model is to accept that the user was granted access because, well, he's using the data, isn't he? If this gets to be a bother, then whenever that "someday" rolls around, we should be able to address the issue.

Enter Gazzang

That someday has already happened. The latest venture from Larry Warnock, the former executive of CMS pioneer Vignette Systems, is called Gazzang. For about four years, Gazzang has been producing an encryption solution that's now applied to MySQL databases, called ezNcrypt. It's had a version of ezNcrypt for Hadoop for a little while, but it's lacked the kind of management tools that compel administrators to maintain the encryption keys on separate volumes. Today, Gazzang announced the release of a cloud-based encryption platform that provides customers with the encryption, the policy-making tools for securing and ensuring access, and the key management tools as a service.

"The cloud-based platform transparently encrypts and secures data 'on the fly' whether in the cloud or on premises, ensuring there is minimal performance lag in the encryption or decryption process," reads a data sheet published by Gazzang this morning (PDF available here). "The platform also includes advanced key management and access controls that help organizations meet compliance regulations and allow users to store their cryptographic keys separate from the encrypted data."

Citing a Forrester report that tagged data susceptible of falling through the encryption gap as "toxic data," Warnock said this in a blog post this morning: "Organizations that fail to protect and encrypt this data leave themselves exposed to attacks and possibly even fines. Companies like Stratfor, Sony, and Epsilon - who failed to encrypt toxic data - all took severe hits to their brand and combined lost millions of dollars in potential revenue. But worse still, is these companies all lost the trust of their customers. How do you put a price on that? People will shy away from organizations that aren't trusted stewards of their information. This includes not only the data itself but the histories of their data, application and Web usage. Retroactively trying to protect this data is far more difficult than securing it at the outset. Organizations must consider this before it is too late."