Extending transparency, and happy birthday to the archive

I was delighted by two essays by Anton Howes on The Replication Crisis in History Open History. We computerists have long had an open culture: we make our publications open, as well as sharing the software we write and the data we analyse. My work on security economics and security psychology has taught me that this culture is not yet as well-developed in the social sciences. Yet we do what we can. Although we can’t have official conference proceedings for the Workshop on the Economics of Information Security – as then the economists would not be able to publish their papers in journals afterwards – we found a workable compromise by linking preprints from the website and from a liveblog. Economists and psychologists with whom we work have found their citation counts and h-indices boosted by our publicity mechanisms; they have incentives to learn.

A second benefit of transparency is reproducibility, the focus of Anton’s essay. Scholars are exposed to many temptations, which vary by subject matter, but are more tempting when it’s hard for others to check your work. Mathematical proofs should be clear and elegant but are all too often opaque or misleading; software should be open-sourced for others to play with; and we do what we can to share the data we collect for research on cybercrime and abuse.

Anton describes how more and more history books are found to have weak foundations, where historians quote things out of context, ignore contrary evidence, and elaborate myths and false facts into misleading stories that persist for decades. How can history correct itself more quickly? The answer, he argues, is Open History: making as many sources publicly available as possible, just like we computerists do.

As it happens, I scanned a number of old music manuscripts years ago to help other traditional music enthusiasts, but how can this be done at scale? One way forward comes from my college’s Archives Centre, which holds the personal papers of Sir Winston Churchill as well as other politicians and a number of eminent scientists. There the algorithm is that when someone requests a document, it’s also scanned and put online; so anything Alice looked at, Bob can look at too. This has raised some interesting technical problems around indexing and long-term archiving which I believe we have under control now, and I’m pleased to say that the Archives Centre is now celebrating its 50th anniversary.

It would also be helpful if old history books were as available online as they are in our library. Given that the purpose of copyright law is to maximise the amount of material that’s eventually available to all, I believe we should change the law to make continued copyright conditional on open access after an initial commercial period. Otherwise our historians’ output vanishes from the time that their books come off sale, to the time copyright expires maybe a century later.

My own Security Engineering book may show the way. With both the first edition in 2001 and the second edition in 2008, I put six chapters online for free at once, then released the others four years after publication. For the third edition, I negotiated an agreement with the publishers to put the chapters online for review as I wrote them. So the book came out by instalments, like Dickens’ novels, from April 2019 to September 2020. On the first of November 2020, all except seven sample chapters disappeared from this page for a period of 42 months; I’m afraid Wiley insisted on that. But after that, the whole book will be free online forever.

This also makes commercial sense. For both the 2001 and 2008 editions, paid-for sales of paper copies increased significantly after the whole book went online. People found my book online, liked what they saw, and then bought a paper copy rather than just downloading it all and printing out a thousand-odd pages. Open access after an exclusive period works for authors, for publishers and for history. It should be the norm.

How to Spread Disinformation with Unicode

There are many different ways to represent the same text in Unicode. We’ve previously exploited this encoding-visualization gap to craft imperceptible adversarial examples against text-based machine learning systems and invisible vulnerabilities in source code.

In our latest paper, we demonstrate another attack that exploits the same technique to target Google Search, Bing’s GPT-4-powered chatbot, and other text-based information retrieval systems.

Consider a snake-oil salesman trying to promote a bogus drug on social media. Sensible users would do a search on the alleged remedy before ordering it, and sites containing false information would normally be drowned out by genuine medical sources in modern search engine rankings. 

But what if our huckster uses a rare Unicode encoding to replace one character in the drug’s name on social media? If a user pastes this string into a search engine, it will throw up web pages with the same encoding. What’s more, these pages are very unlikely to appear in innocent queries.

The upshot is that an adversary who can manipulate a user into copying and pasting a string into a search engine can control the results seen by that user. They can hide such poisoned pages from regulators and others who are unaware of the magic encoding. These techniques can empower propagandists to convince victims that search engines validate their disinformation.

The Pre-play Attack in Real Life

Recently I was contacted by a Falklands veteran who was a victim of what appears to have been a classic pre-play attack; his story is told here.

Almost ten years ago, after we wrote a paper on the pre-play attack, we were contacted by a Scottish sailor who’d bought a drink in a bar in Las Ramblas in Barcelona for €33, and found the following morning that he’d been charged €33,000 instead. The bar had submitted ten transactions an hour apart for €3,300 each, and when we got the transaction logs it turned out that these transactions had been submitted through three different banks. What’s more, although the transactions came from the same terminal ID, they had different terminal characteristics. When the sailor’s lawyer pointed this out to Lloyds Bank, they grudgingly accepted that it had been technical fraud and refunded the money.

In the years since then, I’ve used this as a teaching example both in tutorial talks and in university lectures. A payment card user has no trustworthy user interface, so the PIN entry device can present any transaction, or series of transactions, for authentication, and the customer is none the wiser. The mere fact that a customer’s card authenticated a transaction does not imply that the customer mandated that payment.

Payment by phone should eventually fix this, but meantime the frauds continue. They’re particularly common in nightlife establishments, both here and overseas. In the first big British case, the Spearmint Rhino in Bournemouth had special conditions attached to its license for some time after a series of frauds; a second case affected a similar establishment in Soho; there have been others. Overseas, we’ve seen cases affecting UK cardholders in Poland and the Baltic states. The technical modus operandi can involve a tampered terminal, a man-in-the-middle device or an overlay SIM card.

By now, such attacks are very well-known and there really isn’t any excuse for banks pretending that they don’t exist. Yet, in this case, neither the first responder at Barclays nor the case handler at the Financial Ombudsman Service seemed to understand such frauds at all. Multiple transactions against one cardholder, coming via different merchant accounts, and with delay, should have raised multiple red flags. But the banks have gone back to sleep, repeating the old line that the card was used and the customer PIN was entered, so it must all be the customer’s fault. This is the line they took twenty years ago when chip and pin was first introduced, and indeed thirty years ago when we were suffering ATM fraud at scale from mag-strip copying. The banks have learned nothing, except perhaps that they can often get away with lying about the security of their systems. And the ombudsman continues to claim that it’s independent.

Will GPT models choke on their own exhaust?

Until about now, most of the text online was written by humans. But this text has been used to train GPT3(.5) and GPT4, and these have popped up as writing assistants in our editing tools. So more and more of the text will be written by large language models (LLMs). Where does it all lead? What will happen to GPT-{n} once LLMs contribute most of the language found online?

And it’s not just text. If you train a music model on Mozart, you can expect output that’s a bit like Mozart but without the sparkle – let’s call it ‘Salieri’. And if Salieri now trains the next generation, and so on, what will the fifth or sixth generation sound like?

In our latest paper, we show that using model-generated content in training causes irreversible defects. The tails of the original content distribution disappear. Within a few generations, text becomes garbage, as Gaussian distributions converge and may even become delta functions. We call this effect model collapse.

Just as we’ve strewn the oceans with plastic trash and filled the atmosphere with carbon dioxide, so we’re about to fill the Internet with blah. This will make it harder to train newer models by scraping the web, giving an advantage to firms which already did that, or which control access to human interfaces at scale. Indeed, we already see AI startups hammering the Internet Archive for training data.

After we published this paper, we noticed that Ted Chiang had already commented on the effect in February, noting that ChatGPT is like a blurry jpeg of all the text on the Internet, and that copies of copies get worse. In our paper we work through the math, explain the effect in detail, and show that it is universal.

This does not mean that LLMs have no uses. As one example, we originally called the effect model dementia, but decided to rename it after objections from a colleague whose father had suffered dementia. We couldn’t think of a replacement until we asked Bard, which suggested five titles, of which we went for The Curse of Recursion.

So there we have it. LLMs are like fire – a useful tool, but one that pollutes the environment. How will we cope with it?

2023 Workshop on the Economics of Information Security

WEIS 2023, the 22nd Workshop on the Economics of Information Security, will be held in Geneva from July 5-7, with a theme of Digital Sovereignty. We now have a list of sixteen accepted papers; there will also be three invited speakers, ten posters, and ten challenges for a Digital Sovereignty Hack on July 7-8.

The deadline for early registration is June 10th, and we have discount hotel bookings reserved until then. As Geneva gets busy in summer, we suggest you reserve your room now!

Interop: One Protocol to Rule Them All?

Everyone’s worried that the UK Online Safety Bill and the EU Child Sex Abuse Regulation will put an end to end-to-end encryption. But might a law already passed by the EU have the same effect?

The Digital Markets Act ruled that users on different platforms should be able to exchange messages with each other. This opens up a real Pandora’s box. How will the networks manage keys, authenticate users, and moderate content? How much metadata will have to be shared, and how?

In our latest paper, One Protocol to Rule Them All? On Securing Interoperable Messaging, we explore the security tensions, the conflicts of interest, the usability traps, and the likely consequences for individual and institutional behaviour.

Interoperability will vastly increase the attack surface at every level in the stack – from the cryptography up through usability to commercial incentives and the opportunities for government interference.

Twenty-five years ago, we warned that key escrow mechanisms would endanger cryptography by increasing complexity, even if the escrow keys themselves can be kept perfectly secure. Interoperability is complexity on steroids.

Bugs still considered harmful

A number of governments are trying to mandate surveillance software in devices that support end-to-end encrypted chat; the EU’s CSA Regulation and the UK’s Online Safety bill being two prominent current examples. Colleagues and I wrote Bugs in Our Pockets in 2021 to point out what was likely to go wrong; GCHQ responded with arguments about child protection, which I countered in my paper Chat Control or Child Protection.

As lawmakers continue to discuss the policy, the latest round in the technical argument comes from the Rephrain project, which was tasked with evaluating five prototypes built with money from GCHQ and the Home Office. Their report may be worth a read.

One contender looks for known-bad photos and videos with software on both client and server, and is the only team with access to CSAM for training or testing (it has the IWF as a partner). However it has inadequate controls both against scope creep, and against false positives and malicious accusations.

Another is an E2EE communications tool with added profanity filter and image scanning, linked to age verification, with no safeguards except human moderation at the reporting server.

The other three contenders are nudity detectors with various combinations of age verification or detection, and of reporting to parents or service providers.

None of these prototypes comes close to meeting reasonable requirements for efficacy and privacy. So the project can be seen as empirical support for the argument we made in “Bugs”, namely that doing surveillance while respecting privacy is really hard.

Security economics course

Back in 2015 I helped record a course in security economics in a project driven by colleagues from Delft. This was launched as an EDX MOOC as well as becoming part of the Delft syllabus, and it has been used in many other courses worldwide. In Brussels, in December, a Ukrainian officer told me they use it in their cyber defence boot camp.

There’s been a lot of progress in security economics over the past seven years; see for example the liveblogs of the workshop on the economics of information security here. So it’s time to update the course, and we’ll be working on that between now and May.

If there are any topics you think we should cover, or any bugs you’d like to report, please get in touch!