AOL has recently been embarrassed after it released data on the searches performed by 658,000 subscribers. Their names had been replaced by numbers, but this was not enough to stop personal information leaking. The AOL folks just didn’t understand that protecting data using de-identification is hard.
They are not alone. An NHS document obtained under the Freedom of Information Act describes how officials are building a “Secondary Uses Service” which will contain large amounts of personal health information harvested from hospital and other records. It’s proposed that ever-larger numbers of people will have access to this information as it is progressively de-identified. It seems that officials are just beginning to realise how difficult it will be to protect patient privacy — especially as your deidentified medical record will generally have your postcode. There are only a few houses at each postcode; knowing that, plus a patient’s age, usually tells you whose record it is. The NHS proposes to set up an “Information Governance Board” to think about the problem. Meanwhile, system development steams ahead.
Clearly, the uses and limitations of anonymisation ought to be more widely understood. There’s more on the subject at the American Statistical Association website, on my web page and in chapter 8 of my book.
AOL’s motivation in releasing their data was apparently to encourage researchers to use their real data so as to improve the user experience for searching. Encouraging research is fine, their main error was in not properly controlling the way the data was used.
Many researchers are interested in processing traffic data from TCP/IP networks. They hope to better understand the real impact of congestion algorithms, tweaks to network stacks etc. This data is usually anonymised before being handed over — but some of it can be deanonymised because particular machines will have particular patterns of traffic that could be picked out. UKERNA are tackling this by only releasing traffic data to researchers who have signed an appropriate agreement (“Provision of Traffic Data for Research Use” [not yet published]) one of the terms of which is that the researcher contracts “not make any attempt to decompile, interfere, manipulate or otherwise take any action in respect of the Data which may have the result of revealing either message content or Personal Data contained in that Data, unless expressly permitted in the Data Specification;”.
Assuming of course that researchers are honest and law abiding, this permits academic research without infringing people’s privacy. AOL should look around and see how other people have dealt with the issues and not conclude that releasing data for research is inherently impossible.
Boing Boing has a couple of posts pointing out how this AOL data can be deanonymised:
The Search Engine Confessions of AOL User 23187425
AOL’s user query database has been splunk’d
re:NHS – Secondary Use Service,
One has to ask why they don’t recruit you Ross! or someone under your tuition/supervision.
I certainly made my NHS IG colleagues aware of your work (Chapter 8, Medical records, inference etc), I think a comment from a previous IG director was what do you know Ross! (cough cough), I felt my resignation letter began there!
Anyway it’s politics, delivery is king! for CFH, not what’s right. That’s my experience of 2 years of working there.