This afternoon, the Information Commissioner will unveil a code of practice for data anonymisation. His office is under pressure; as I described back in August, Big Pharma wants all our medical records and has persuaded the Prime Minister it should have access so long as our names and addresses are removed. The theory is that a scientist doing research into cardiology (for example) could have access to the anonymised records of all heart patients.
The ICO’s blog suggests that he will consider data to be anonymous and thus no longer private if they cannot be reidentified by reference to any other data already in the public domain. But this is trickier than you might think. For example, Tim Gowers just revealed on his excellent blog that he had an ablation procedure for atrial fibrillation a couple of weeks ago. So if our researcher can search for all males aged 45-54 who had such a procedure on November 6th 2012 he can pull Tim’s record, including everything that Tim intended to keep private. Even with a central cardiology register, it’s hard to think of a practical mechanism could block Tim’s record as soon as he made that blog post. But now researchers are starting to carry round millions of people’s records on their laptops, protecting privacy is getting really hard.
In his role as data protection regulator, the Commissioner has been eager to disregard the risk of re-identification from private information. Yet Maurice Frankel of the Campaign for Freedom of Information has pointed out to me that he regularly applies a very different rule in Freedom of Information cases, including one involving the University of Cambridge. There, he refused a freedom of information request about university dismissals on the grounds that “friends, former colleagues, or acquaintances of a dismissed person may, through their contact with that person, know something of the circumstances of that person’s departure” (see para 30).
So I will be curious to see this afternoon whether the Commissioner places greater value on the consistency of his legal rulings, or their convenience to the powerful.
“The ICO’s blog suggests that he will consider data to be anonymous and thus no longer private if they cannot be reidentified by reference to any other data already in the public domain…”
Aha! Already a (deliberate?) loophole! What about other data that subsequently becomes available to a given pharma corporation (and not necessarily to the general public)?
Aggregation is far too slippery a slope to set foot on.
Well, although the extent to which the data can be anonymised and still keeps its value for the researchers is very shady, the main problem is that one can never be confident that Tim’s record cannot be re-identified, let alone passing a code for that by the Information Commissioner.
The code is now online (though if you’re not a Microsoft user you might have to rename the .ashx download as a .pdf to read it).
The ICO has failed to take on board many of the points made by FIPR and others during the consultation. The average individual’s privacy set – the set of people I’d rather not know all the embarrassing facts about me – is precisely the set of my family and friends, and they all probably know some non-public context. In Tim’s case above, this would include the fact that he had an ablation procedure on November 6th 2012. So if you publish “anonymised” records, and exactly one of them records such a procedure, then all his friends (and everyone who reads his blog) can identify his record and associate him with everything else in it. Yet the consideration of personal knowledge as a reidentification vector is deprecated at p 25. Worse, at p 26 the ICO shifts the burden of proof: “There must be a plausible and reasonable basis for non-recorded personal knowledge to be considered to present a significant re-identification risk.” This of course begs the question of what is (or will in the future) be recorded, and where.
And then there’s technology: the assumption (p 25) that our individual genotypes are private may not be true in 20 years’ time, if the Wellcome Trust has its wicked way with us.
Perhaps the biggest failing is transparency. Despite the row over CPRD, the ICO will only go as far as to say on p 40 that an organisation should “explain why you anonymise individuals’ personal data and describe in general terms the techniques that will be used to do this.” It does tell organisations to be open about risks and about their reasoning processes, warning that “excessive secrecy is likely to generate public distrust and suspicion” (p 41); but this very phrasing accepts the Whitehall assumption that secrecy about anonymisation methods is a good thing. Of course it isn’t; it’s designed to minimise the risk that officials will be criticised rather than the privacy risk to data subjects.
It’s also sad to see the old canard about outsourcing the risk to a “trusted third party”. Twenty years on from the start of the crypto wars, this should instantly ring an alarm bell. When government wants to do something dodgy with our information, outsource it to a company, and then it’s not ministers’ fault when things go wrong.
Much of the rest is as we might have expected. The ICO’s foreword (pdf p 4) signals the UK government’s concern that the new data protection regulation might remove the loophole Britain has used up till now (our failure to implement recital 26 and our disregard for the Article 29 working party definition of personal data); the UK has seen to it that sections 81 and 83 of the draft regulation give a comparable loophole for research and for health administration. In some sense the code of practice is theatre, as the NHS will continue to share vast amounts of identifiable information without patient consent, but in another sense it’s serious as once the Secretary of State for Justice approves it, it will become significantly harder to sue a company that followed it and still violated your privacy. The code cannot remove your rights under ECHR and the Human Rights Act, but the code ignores that and even claims (p 35) “if a disclosure is compliant with the DPA it is likely to be compliant with the HRA” without mentioning the common cases (such as healthcare) where this is emphatically not so.
Many minor nits could be picked. For example, “inference control” generally refers to the whole field of statistical security, not just to one technique, as suggested on p 52. Annex 1 is quite unrealistic, for example. In case study 11 they display lack of access to expert advice by talking about using one-way hash functions yet deprecating the use of older hash functions; while MD5 has findable collisions, it’s still one-way, and an expert would have told them this. There is no mention anywhere of active attacks. And there are proofreading failures too: e.g. p 13 “This means that the DPA…” duplicated; and confusion about whether “data” is a singular or plural noun.
Overall assessment: some improvement over the last draft but it’s still unsatisfactory.