Data collected from customers is routinely anonymized and then sold or otherwise disseminated for research purposes. But does anonymization work? One particularly high profile case was Netflix’s release of its customer data as part of its machine learning algorithm contest. According to Forbes’ firewall blog, researchers were able to de-anonymize some of this data by doing things like cross-referencing it with IMDB comments. Netflix wound up canceling its later contest.
But is this sort of re-identifying practical, and does it make anonymizing data a pointless endeavor? No says arecent report by Ontario Information & Privacy Commissioner Ann Cavoukian, Ph.D.and Khaled El Emam, Ph.D of the University of Ottawa. The report acknowledges the potential to re-identify data in some cases, but emphasizes that de-identification is an import means to safe guard privacy while enabling important research in areas such as medicine.
The report says that although 100% guarantees of anonymity do not exist, that shouldn’t deter organizations from doing what they can to reduce risks via de-identification.
The report also argues that re-identification is a a very difficult task, and despite a couple of high profile cases (such as the Netflix data and tk), anonymized data is rarely successfully de-anonymized. Here’s an excerpt:
For example, a recent study undertaken for the U.S. Department of Health and Human Services’ Office of the National Coordinator for Health Information Technology (“ONC”) sheds some light on the likelihood of a successful attack on properly deidentified data. The ONC assembled a team of statistical experts to assess whether data properly de-identified under HIPAA could be combined with readily available outside
data, to re-identify patients. The study was performed under realistic conditions and
the re-identifications were verified to be accurate – something that other studies of
this nature generally lack. The team began with a set of approximately 15,000 patient
records that had been de-identified in accordance with HIPAA. Next, they sought to
match the de-identified records with identifiable records in a commercial data repository.
They conducted extensive searches through commercial data sources (e.g. InfoUSA) to
determine whether any of the records in the identified commercial data would align with
the records in the de-identified data set. The team was able to accurately re-identify only
two of the 15,000 individuals, for a match rate of 0.013 per cent. This is an extremely
low re-identification risk!
The report also cites some of El Emam’s prior research finding that it was to re-identify less than .5% of the individuals found in a large database of medical records, and noted that the identification of each one (often aided by social network data available online) took hours of time by skilled technicians.
Others have argued that re-identifying data is actually quite easy. In 2007, Bruce Schneier rounded up several studies by researchers who were able to identify large percentages of people from data sets. His conclusion was that anonymized data sets should be subjected to adversarial attacks before being released to the public.
As both scientific organizations and private enterprises seek to learn more by mining big data sets, protecting individual privacy is becoming a hotter issue. This debate is one to watch.
Photo by Circo de Invierno