Leaks that suggest the NSA is vacuuming up personal information from the phone records and online-service data of U.S. citizens have some people concerned about the prospect of an Orwellian surveillance state that can track our every move.
But hold on a second. More recent revelations suggest that intelligence agencies are using Hadoop and other well-established Big Data business tools to run their analyses. And that suggests there are actually technical limits on how much the NSA can actually track and when.
So let’s run through a brief technical thought experiment. Just how much can the government learn about you using Hadoop?
The Limits Of Hadoop
Based on sources close to the NSA, the Wall Street Journal has reported that the agency is very likely making use of the data gathering and analysis platform Hadoop as well as non-relational database technology — those NoSQL databases you’ve heard tell of.
If this is indeed the case, then right off the bat, you should know this: The NSA is most decidedly not analyzing the information of everyone all at once in real time.
Recall that Hadoop is not really a database, but rather a distributed file storage system, where a lot of data can be stored on inexpensive servers and then analyzed as needed.
See also: Hadoop: What It Is And How It Works
Making sense of data within Hadoop requires use of an analytical engine known as MapReduce. MapReduce is very good at what it does and very fast, but it is also limited by being a batch processor — meaning that it can only run one analytical job at a time.
That takes time. Also, MapReduce isn’t exactly the easiest tool to master. So if you’re running a search, you’ll need to meticulously craft your code to construct the best search pattern across a lot of potentially varied data.
Other analytical methods can use tools such as NoSQL databases to reach into Hadoop, pull out data, and run search queries. These operations can be fast, but usually that because they’re also relying on subsets of the entire data set. But that means you’re doing some form of data sampling going on, which can quickly futz up the accuracy of your search results if you’re not careful.
Given all that, it’s highly improbable that the NSA can be doing real-time tracking with Hadoop. Not that anyone should find that much comfort.
Looking For Patterns
Instead of real-time analysis, what the government is doing is something more along the lines of establishing patterns of behavior that will ideally lead to clues about illicit activity — in much the same way private corporations use such data to figure you out through your buying habits.
The corporations watching us don’t know, for instance, that I just bought that candy bar in the convenience store. But over time they may learn that I like a certain brand of chocolate. That sort of information is valuable to them because they could conceivably sell me more, or alternative brands, or some other product that other people who like that chocolate tend to buy.
At the end of the day, the NSA, like those companies, has a vested interest in getting things right. It wants to make sure its target is indeed the right one. From a cold-hearted point of view, this makes sense: get the wrong person, and the true criminal — or terrorist — gets away. (That’s a little more serious than a company that tries to sell me the wrong brand of deodorant.)
Hadoop is a good tool to get started on this kind of pattern matching and intelligence gathering. After the NSA builds evidence and observations on terrorists and other criminal organizations, then it’s very likely to use other assets — human or technological — to begin real-time tracking of the suspects.
This may be small comfort to anyone worried about getting caught up in an NSA dragnet, or who fear that its surveillance methods might be abused by agency malcontents — or who think surveillance abuse might one day be adopted as a matter of policy.
Souped-Up Hadoop
And let’s not kid ourselves about all of these limitations. It’s a fair bet that there are data scientists at the NSA who really know their way around Hadoop.
Actually, that’s no bet at all. In 2011, the NSA contributed a distributed NoSQL database called Accumulo to the Apache Software Foundation. Accumulo is a key/value database that works with Hadoop, described as a “robust, scalable, high performance data storage and retrieval system.”
So the limitations of Hadoop I’ve just laid out may not actually hold back the NSA, because the agency quite likely has code in its servers that makes Hadoop, Accumulo and a lot of open source NoSQL technology do tricks that commercial users can only dream of.
And, because the Apache Software License that covers Hadoop is a permissive open source license, any changes the NSA makes to Hadoop source code, or that of other applications, don’t have to flow back to the main project.
The technology for collecting and sifting through the data of our lives is not omniscient, and only as good as the people who run it. It’s good at what it does, and it is getting better. Good as the NSA is, the technological limitations of its Big Data surveillance probably prevents its spies from seeing what you are doing right this very minute. For now. Maybe.