Trade Law Daily is a service of Warren Communications News.
Argument ‘Misleading’

Risks of Data Re-Identification Overstated, Report Says

The “myth” that data sets cannot be de-identified has been perpetuated by individuals misrepresenting and overstating the claims of prominent research showing instances of re-identification, said a joint report released Monday by Ontario Privacy Commissioner Ann Cavoukian and the Information Technology and Innovation Foundation (ITIF) (http://bit.ly/1kYvlPr). But in interviews Monday, other researchers took exception to how the report discussed their own work, saying it misrepresented findings and ignored facts. This is the first in a two-part story on the report.

Sign up for a free preview to unlock the rest of this article

Timely, relevant coverage of court proceedings and agency rulings involving tariffs, classification, valuation, origin and antidumping and countervailing duties. Each day, Trade Law Daily subscribers receive a daily headline email, in-depth PDF edition and access to all relevant documents via our trade law source document library and website.

"We're hoping we're really going to be able to change the conversation,” Daniel Castro, head of ITIF’s Center for Data Innovation, told us. The pervasive belief that data can’t be de-identified hinders a desire to improve data de-identification techniques through National Institute of Standards and Technology (NIST) standards and congressional legislation, Castro said.

But the paper makes this argument relying on contextless facts and research, several researchers said. “Misleading at best and completely irresponsible at worst,” said Arvind Narayanan, a computer science assistant professor at Princeton, regarding the characterization of his work, which was cited in the paper.

The report surveyed commonly cited re-identification research -- FTC Chief Technologist Latanya Sweeney’s work with Census data, recent work on mobility data (http://bit.ly/194eDYY), a 2008 report on re-identifying Netflix users (http://bit.ly/SOeFym) and a 2012 global data mining competition (http://bit.ly/1i5rktd). There has been “a tendency on the part of commentators on that literature to overstate the findings,” the report said, citing numerous media stories.

Take Sweeney’s work -- all conducted before she came to the FTC. She notably used 1990 U.S. Census data to show three data points -- gender, date of birth and ZIP code -- could hypothetically uniquely identify 87 percent of U.S. citizens given a complete database (http://bit.ly/1yai8bn). The report discussed this, but points out “it is important to note” when Stanford researchers repeated this exercise using 2010 Census data, they found those three data points could uniquely identify 63 percent of the U.S. population. “Different de-identification methods result in vastly different outcomes, especially when more recent efforts are compared to earlier efforts at de-identification,” the report said.

"Which of these models is best, or even if another model is more accurate, is not the public takeaway,” Sweeney told us. “The policy implication is that regardless, the number of people being unique at these percentages poses real concerns for sharing those values widely.” Both Sweeney’s research and the report noted reducing the number of data points used -- replacing date of birth with only the month and year, for example -- drastically reduces the uniquely identifiable percentage to under 5 percent. “This was the tack taken” when the Health Insurance Portability and Accountability Act privacy rules were being created, Sweeney said.

These and other privacy standards are not given proper credit in public discourse, the report said. “Existing de-identification standards err on the side of privacy when it comes to balancing the utility of the data and the risk of re-identification.” And the government is not pushing hard enough to strengthen these standards, Castro said, pointing to the White House big data report’s implication that data can’t truly be de-identified (WID May 2 p1). NIST and Commerce Department should produce recommended de-identification guidelines, “just like they do with cryptography,” to create “best practices” that are “based on the current state of knowledge,” said Castro.

"The key issue ignored by the ITIF report is that ‘de-identified’ data, once publicly released, stays public forever,” said Vitaly Shmatikov, an associate computer science professor at the University of Texas at Austin and co-author of the Netflix study. “Meanwhile, powerful new re-identification algorithms and rich new sources of information about individuals continue to emerge, putting the data at risk tomorrow, next year and forever into the future.”