NSA Concedes Hadoop Beats Its Pricey Alternatives

The U.S. has spent hundreds of billions of dollars fighting terrorists. Ironically, however, the best technology available to fight terror likely isn't a line item on the National Security Agency's (NSA) $8 to $10 billion budget. More probably it's Hadoop, which is open source and 100% free.

This is a big deal, because it suggests that technology has finally become democratized. When an interested 10-year old programmer has access to the same heavy-duty Big Data technology as the budget-rich NSA, we have arrived. And we have open source to thank for it.

Not that the NSA necessarily is using stock Hadoop. As ReadWrite's Brian Proffitt points out, "the agency quite likely has code in its servers that makes Hadoop, Accumulo and a lot of open source NoSQL technology do tricks that commercial users can only dream of."

Maybe.

NSA Concedes: Open Source Is Better

But as The Wall Street Journal points out, the NSA actually turned to Hadoop precisely because it couldn't out-innovate the open-source community. So while it may change the Hadoop code to make it more applicable to the NSA's needs, doing so establishes a fork that takes it beyond the mainline community code, making it harder for these government agencies to leverage the apparently superior efforts of the open-source community.

That said, the U.S. government hasn't been content to sit back and wait for the open-source community to build out Hadoop. In-Q-Tel, the Central Intelligence Agency's (CIA) investment arm, is an investor in Hadoop vendor Cloudera. (Disclosure: In-Q-Tel is also an investor in my company, 10gen.)

Private Industry Leads The Way

But the U.S. government needn't worry. With or without Federal investment, Hadoop development is extraordinarily well-funded by private industry in ways that would warm a spy's heart.

Google, after all, inspired Hadoop with a research paper years ago that gave the world a peek into MapReduce. Yahoo! may have taken that work and run with it to actually release Hadoop as an open-source project, but Google's research then and now regularly pushes the industry forward.

Facebook, which once claimed in 2011 to have the world's largest Hadoop cluster at 30 petabytes, uses Hadoop to store data and Hive to analyze billions of pieces of content daily on Facebook, looking for ways to present users with the most relevant content. This process of making inferences about interests and behavior is likely at least as sophisticated as what the NSA does. If Facebook's security is good enough to convince the NSA to hire its chief security officer, it's not unreasonable to assume the social network has something to teach the spy agency about gleaning information about personal relationships with Hadoop, too.

Then there's Apple. Hadoop is the brain behind Apple's Siri, performing the heavy lifting behind Siri's voice-activated artificial intelligence. It's doubtful that the NSA work around natural language processing and associated information parsing will be any more advanced than that of Apple's own gaggle of Hadoop engineers.

The list goes on, from Yahoo, which arguably runs the world's largest Hadoop cluster, to EMC's Greenplum (now Pivotal), which runs the world's largest publicly available Hadoop cluster. In fact, the NSA actually used EMC's cluster to test and optimize its own NoSQL database, Accumulo. Or if the government wants to leave Silicon Valley and instead talk to one of its big suppliers, General Electric, it might learn a bit about applying Hadoop to the "Internet of Things."

We Are All "In Cahoots" With The NSA

In short, while Google's chief legal officer, David Drummond, insists that Google "is not in cahoots with the NSA," the reality is that everyone in the Hadoop and related open-source communities is. Not by choice or nefarious design, but simply because the open-source community now regularly writes better software than billions of dollars in government money can buy, and agencies like the NSA recognize this.

The next phase is for such government agencies to start participating actively in the communities from which they derive so much benefit. Accumulo is a start, but if the government is serious about pushing the state of the art with Hadoop and other Big Data technologies, it needs to contribute code, not just cash.