While politicians, pundits, military, and journalists assess and debate the fallout from Wikileaks’ release of the “Afghan War Diary” – the legality and ethics of Wikileaks, its impact on the war efforts, the rise of the “world’s first stateless news organization” – a number of developers are diving right into the 91,000 some odd classified documents and seeing what they can do with the data.

And it’s a substantial chunk of data. The documents dated from 2004 to 2010 are available in HTML, CSV, or SQL formats, as well as several KML files. But even in the HTML format, reading through the Afghan War Diary is no easy task. This is no Stephen Ambrose-presentation of history. It’s raw data, with the following queries available: type, category, region, affiliation, date, severity.
Analyzing the Wikileaks Data Dump
Der Spiegel, the Guardian, and the New York Times received the data a month before Wikileaks took it public, and their researchers and journalists have sifted through the information to present their “news” narratives. The Guardian also offers its readers some interactive online tools to help them understand the documents. But now that the information is publicly accessible, the research and analysis of the data is distributed. On his blog Zero Intelligence Agents, NYU Politics Department grad student Drew Conway has started undertaking a statistical analysis of the data, for example. His scripts join the other projects like it that are being built and shared by developers.
Building the Wikileaks CouchApp
One such project is the Wikileaks CouchApp, created by CouchDB community member Benoit Chesneau. The app was built using a number of open source tools including CouchDB 1.0, GeoCouch, jQuery, Simile Timeline, and OpenLayer and is integrated with Google Maps. These tools allow for the Wikileaks documents, imported to CouchDB from the CSV file, to be categorized and queried with geospatial and temporal data. Scrolling through the Wikileaks CouchApp’s timeline allows you to browse the reports by date and plots them in a map below. Clicking on the map point displays a popup, where you can read some information about the report or click through to read it in its entirety.
Why CouchDB?
CouchDB is a post-relational document database. Unlike the strict schemas in relational databases, CouchDB is more flexible, storing data in a semi-structured fashion and using a JavaScript-based view model for generating report results. This flexibility allows users to make queries on demand, rather than being, in the words of CouchDB creator Damien Katz, restricted to “however somebody else cooked the database up.” You can do more with the data in CouchDB argues Katz, as you can write queries, including full-text engine ones.
But it’s not just the flexibility of CouchDB that makes it an interesting choice for a Wikileaks database. CouchDB is a peer-based distributed database system. In other words, any number of CouchDB hosts – both servers and offline clients – can have independent replica copies of the same database. These copies can be fully interactive with the ability to query, add, edit, and delete, and changes to the database can be replicated across the mirror copies in near real-time.
For businesses using CouchDB, the ability to reliably synchronize databases between multiple machines can better provide redundancy and aid load balancing. And in the case of the Afghan War Diary CouchDB app, it means these mirrored copies make it impossible to shut Wikileaks down. Currently the app is hosted on the CouchDB server and while copies have been replicated, neither Katz nor Chesneau know of any other publicly available copies.
Katz calls CouchDB the “information dissemination platform of the future.” Touting its security, its scalability, and its flexibility, as well as its rigorous security features, Katz thinks the entire Wikileaks site, not just this app, should move to CouchDB. As the US military demands Wikileaks “return all documents” and some call for the organization’s Swedish ISP to shut the site down, who knows what sorts of technical steps Wikileaks will take.
Tech Tools for a Data-Driven Future
As with any large dataset, the Wikileaks documents provide raw data rich for building analytical and visualization tools. But the Wikileaks data – its content, the means by which it was secured and disseminated – remains highly controversial. Noting the risks involved with “possessing” the Wikileaks documents, PhD candidate Drew Conway still chose to move forward with his analysis of the data, arguing that “with the proper analytical tools, this data may reveal insights to the predicates of conflict in ways that previous aggregate-level data could not.”
That desire to analyze, visualize, and disseminate information seems to be the motive behind several of the new Wikileaks tools, including the CouchApp. But that desire – as well as the need – should encourage the development of new tech tools, crucial if we are to make sense of “the coming data explosion.”