Category Archives: Academic papers

Evil Searching

Tyler Moore and I have been looking into how phishing attackers locate insecure websites on which to host their fake webpages, and our paper is being presented this week at the Financial Cryptography conference in Barbados. We found that compromised machines accounted for 75.8% of all the attacks, “free” web hosting accounts for 17.4%, and the rest is various specialist gangs — albeit those gangs should not be ignored; they’re sending most of the phishing spam and (probably) scooping most of the money!

Sometimes the same machine gets compromised more than once. Now this could be the same person setting up multiple phishing sites on a machine that they can attack at will… However, we often observe that the new site is in a completely different directory — strongly suggesting a different attacker has broken into the same machine, but in a different way. We looked at all the recompromises where there was a delay of at least a week before the second attack and found that in 83% of cases a different directory was used… and using this definition of a “recompromise” we found that around 10% of machines were recompromised within 4 weeks, rising to 20% after six months. Since there’s a lot of vulnerable machines out there, there is something slightly different about the machines that get attacked again and again.

For 2486 sites we also had summary website logging data from The Webalizer; where sites had left their daily visitor statistics world-readable. One of the bits of data The Webalizer documents is which search terms were used to locate the website (because these are available in the “Referrer” header, and that will document what was typed into search engines such as Google).

We found that some of these searches were “evil” in that they were looking for specific versions of software that contained security vulnerabilities (“If you’re running version 1.024 then I can break in”); or they were looking for existing phishing websites (“if you can break in, then so can I”); or they were seeking the PHP “shells” that phishing attackers often install to help them upload files onto the website (“if you haven’t password protected your shell, then I can upload files as well”).

In all, we found “evil searches” on 204 machines that hosted phishing websites AND that, in the vast majority of cases, these searches corresponded in time to when the website was broken into. Furthermore, in 25 cases the website was compromised twice and we were monitoring the daily log summaries after the first break-in: here 4 of the evil searches occurred before the second break in, 20 on the day of the second break in, and just one after the second break-in. Of course, where people didn’t “click through” from Google search results, perhaps because they had an automated tool, then we won’t have a record of their searches — but neverthless, even at the 18% incidence we can be sure of, searches are an important mechanism.

The recompromise rates for sites where we found evil searches were a lot higher: 20% recompromised after 4 weeks, nearly 50% after six months. There are lots of complicating factors here, not least that sites with world-readable Webalizer data might simply be inherently less secure. However, overall we believe that it clearly indicates that the phishing attackers are using search to find machines to attack; and that if one attacker can find the site, then it is likely that others will do so independently.

There’s a lot more in the paper itself (which is well-worth reading before commenting on this article, since it goes into much more detail than is possible here)… In particular, we show that publishing URLs in PhishTank slightly decreases the recompromise rate (getting the sites fixed is a bigger effect than the bad guys locating sites that someone else has compromised); and we also have a detailed discussion of various mitigation strategies that might be employed, now that we have firmly established that “evil searching” is an important way of locating machines to compromise.

Security issues in ubiquitous computing

I have written the security chapter for a multi-author volume on ubiquitous computing that will be published by Springer later this year. For me it was an opportunity to pull together some of the material I have been collecting for a possible second edition of my 2002 book on Security for Ubiquitous Computing—but of course a 30-page chapter can be nothing more than a brief introduction.

Anyway, here is a “release candidate” copy of the chapter, which will ship to the book editors in a couple of weeks. Comments are welcome, either on the chapter itself or, based on this preview, on what you’d like me to discuss in my own full-length book when I yield to the repeated pleas of John Wiley And Sons and sit down to write a new edition.

Forensic genomics

I recently presented a paper on Forensic genomics: kin privacy, driftnets and other open questions (co-authored with Lucia Bianchi, Pietro Liò and Douwe Korff) at WPES 2008, the Workshop for Privacy in the Electronic Society of ACM CCS, the ACM Computer and Communication Security conference. Pietro and I also gave a related talk here at the Computer Laboratory in Cambridge.

While genetics is concerned with the observation of specific sections of DNA, genomics is about studying the entire genome of an organism, something that has only become practically possible in recent years. In forensic genetics, which is the technology behind the large national DNA databases being built in several countries including notably UK and USA (Wallace’s outstanding article lucidly exposes many significant issues), investigators compare scene-of-crime samples with database samples by checking if they match, but only on a very small number of specific locations in the genome (e.g. 13 locations according to the CODIS rules). In our paper we explore what might change when forensic analysis moves from genetics to genomics over the next few decades. This is a problem that can only be meaningfully approached from a multi-disciplinary viewpoint and indeed our combined backgrounds cover computer security, bioinformatics and law.

CODIS markers
(Image from Wikimedia commons, in turn from NIST.)

Sequencing the first human genome (2003) cost 2.7 billion dollars and took 13 years. The US’s National Human Genome Research Institute has offered over 20 M$ worth of grants towards the goal of driving the cost of whole-genome sequencing down to a thousand dollars. This will enable personalized genomic medicine (e.g. predicting genetic risk of contracting specific diseases) but will also open up a number of ethical and privacy-related problems. Eugenetic abortions, genomic pre-screening as precondition for healthcare (or even just dating…), (mis)use of genomic data for purposes other than that for which it was collected and so forth. In various jurisdictions there exists legislation (such as the recent GINA in the US) that attempts to protect citizens from some of the possible abuses; but how strongly is it enforced? And is it enough? In the forensic context, is the DNA analysis procedure as infallible as we are led to believe? There are many subtleties associated with the interpretation of statistical results; when even professional statisticians disagree, how are the poor jurors expected to reach a fair verdict? Another subtle issue is kin privacy: if the scene-of-crime sample, compared with everyone in the database, partially matches Alice, this may be used as a hint to investigate all her relatives, who aren’t even in the database; indeed, some 1980s murders were recently solved in this way. “This raises compelling policy questions about the balance between collective security and individual privacy” [Bieber, Brenner, Lazer, 2006]. Should a democracy allow such a “driftnet” approach of suspecting and investigating all the innocents in order to catch the guilty?

This is a paper of questions rather than one of solutions. We believe an informed public debate is needed before the expected transition from genetics to genomics takes place. We want to stimulate discussion and therefore we invite you to read the paper, make up your mind and support what you believe are the right answers.

How can we co-operate to tackle phishing?

Richard Clayton and I recently presented evidence of the adverse impact of take-down companies not sharing phishing feeds. Many phishing websites are missed by the take-down company which has the contract for removal; unsurprisingly, these websites are not removed very fast. Consequently, more consumers’ identities are stolen.

In the paper, we propose a simple solution: take-down companies should share their raw, unverified feeds of phishing URLs with their competitors. Each company can examine the raw feed, pick out the websites impersonating their clients, and focus on removing these sites.

Since we presented our findings to the Anti-Phishing Working Group eCrime Researchers Summit, we have received considerable feedback from take-down companies. Take-down companies attending the APWG meeting understood that sharing would help speed up response times, but expressed reservations at sharing their feeds unless they were duly compensated. Eric Olsen of Cyveillance (another company offering take-down services) has written a comprehensive rebuttal of our recommendations. He argues that competition between take-down companies drives investment in efforts to detect more websites. Mandated sharing of phishing URL feeds, in his view, would undermine these detection efforts and cause take-down companies such as Cyveillance to exit the business.

I do have some sympathy for the objections raised by the take-down companies. As we state in the paper, free-riding (where one company relies on another to invest in detection so they don’t have to) is a concern for any sharing regime. Academic research studying other areas of information security (e.g., here and here), however, has shown that free-riding is unlikely to be so rampant as to drive all the best take-down companies out of offering service, as Mr. Olsen suggests.

While we can quibble over the extent of the threat from free free-riding, it should not detract from the conclusions we draw over the need for greater sharing. In our view, it would be unwise and irresponsible to accept the current status quo of keeping phishing URL feeds completely private. After all, competition without sharing has approximately doubled the lifetimes of phishing websites! The solution, then, is to devise a sharing mechanism that gives take-down companies the incentive to keep detecting more phishing URLs.
Continue reading How can we co-operate to tackle phishing?

Non-cooperation in the fight against phishing

Tyler Moore and I are presenting another one of our academic phishing papers today at the Anti-Phishing Working Group’s Third eCrime Researchers Summit here in Atlanta, Georgia. The paper “The consequence of non-cooperation in the fight against phishing” (pre-proceedings version here) goes some way to explaining anomalies we found in our previous analysis of phishing website lifetimes. The “take-down” companies reckon to get phishing websites removed within a few hours, whereas our measurements show that the average lifetimes are a few days.

These “take-down” companies are generally specialist offshoots of more general “brand protection” companies, and are hired by banks to handle removal of fake phishing websites.

When we examined our data more carefully we found that we were receiving “feeds” of phishing website URLs from several different sources — and the “take-down” companies that were passing the data to us were not passing the data to each other.

So it often occurs that take-down company A knows about a phishing website targeting a particular bank, but take-down company B is ignorant of its existence. If it is company B that has the contract for removing sites for that bank then, since they don’t know the website exists, they take no action and the site stays up.

Since we were receiving data feeds from both company A and company B, we knew the site existed and we measured its lifetime — which is much extended. In fact, it’s somewhat of a mystery why it is removed at all! Our best guess is that reports made directly to ISPs trigger removal.

The paper contains all the details, and gives all the figures to show that website lifetimes are extended by about 5 days when the take-down company is completely unaware of the site. On other occasions the company learns about the site some time after it is first detected by someone else; and this extends the lifetimes by an average of 2 days.

Since extended lifetimes equate to more unsuspecting visitors handing over their credentials and having their bank accounts cleaned out, these delays can also be expressed in monetary terms. Using the rough and ready model we developed last year, we estimate that an extra $326 million per annum is currently being put at risk by the lack of data sharing. This figure is from our analysis of just two companies’ feeds, and there are several more such companies in this business.

Not surprisingly, our paper suggests that the take-down companies should be sharing their data, so that when they learn about websites attacking banks they don’t have contracts with, they pass the details on to another company who can start to get the site removed.

We analyse the incentives to make this change (and the incentives the companies have not to do so) and contrast the current arrangements with the anti-virus/malware industry — where sample suspect code has been shared since the early 1990s.

In particular, we note that it is the banks who would benefit most from data sharing — and since they are paying the bills, we think that they may well be in a position to force through changes in policy. To best protect the public, we must hope that this happens soon.

Privacy Enhancing Technologies Symposium (PETS 2009)

I am on the program committee for the 9th Privacy Enhancing Technologies Symposium (PETS 2009), to be held in Seattle, WA, USA, 5–7 August 2009. PETS is the leading venue for research on privacy and anonymity, offering an enjoyable environment and stimulating discussion. If you are working in this field, I can strongly recommend submitting a paper.

This year, we are particularly looking for submissions from topics other than anonymous communications, so if work from your field may be applied, or is otherwise related, to the topic of privacy, I’d encourage you to consider PETS as a potential venue.

The submission deadline for the main session is 2 March 2009. As with last year, we will also have a “HotPETS” event, for new and exciting work in the field which is still in a formative state. Submissions for HotPETS should be received by 8 May 2009.

Further information can be found in the call for papers.

An A to Z of confusion

A few days ago I blogged about my paper on email spam volumes — comparing “aardvarks” (email local parts [left of the @] beginning with “A”) with “zebras” (those starting with a “Z”).

I observed that provided one considered “real” aardvarks and zebras — addresses that received good email amongst the spam — then aardvarks got 35% spam and zebras a mere 20%.

This has been widely picked up, first in the Guardian, and later in many other papers as well (even in Danish). However, many of these articles have got hold of the wrong end of the stick. So besides mentioning A and Z, it looks as if I should have published this figure from the paper as well…

Figure 3 from the academic paper

… the point being that the effect I am describing has little to do with Z being at the end of the alphabet, and A at the front, but seems to be connected to the relative rarity of zebras.

As you can see from the figure, marmosets and pelicans get around 42% spam (M and P being popular letters for people’s names) and quaggas 21% (there are very few Quentins, just as there are very few Zacks).

There are some outliers in the figure: for example “3” relates to spammers failing to parse HTML properly and ending up with “3c” (a < character) at the start of names. However, it isn’t immediately apparent why “unicorns” get quite so much spam, it may just be a quirk of the way that I have assessed “realness”. Doubtless some future research will be able to explain this more fully.

Zebras and Aardvarks

We all know that different people get different amounts of email “spam“. Some of these differences result from how careful people have been in hiding their address from the spammers — putting it en claire on a webpage will definitely improve your chances of receiving unsolicited email.

However, it turns out there’s other effects as well. In a paper I presented last week to the Fifth Conference on Email and Anti-Spam (CEAS 2008), I showed that the first letter of the local part of the email address also plays a part.

Incoming email to Demon Internet where the email address local part (the bit left of the @) begins with “A” (think of these as aardvarks) is almost exactly 50% spam and 50% non-spam. However, where the local part begins with “Z” (zebras) then it is about 75% spam.

However, if one only considers “real” aardvarks and zebras, viz: where a particular email address was legitimate enough to receive some non-spam email, then the picture changes. If one treats an email address as “real” if there’s one non-spam email on average every second day, then real aardvarks receive 35% spam, but real zebras receive only 20% spam.

The most likely reason for these results is the prevalence of “dictionary” or “Rumpelstiltskin” attacks (where spammers guess addresses). If there are not many other zebras, then guessing zebra names is less likely.

Aardvarks should consider changing species — or asking their favourite email filter designer to think about how this unexpected empirical result can be leveraged into blocking more of their unwanted email.

[[[ ** Note that these percentages are way down from general spam rates because Demon rejects out of hand email from sites listed in the PBL (which are not expected to send email) and greylists email from sites in the ZEN list. This reduces overall volumes considerably — so YMMV! ]]]

PET Award 2008

At last year’s Privacy Enhancing Technologies Symposium (PETS), I presented the paper “Sampled Traffic Analysis by Internet-Exchange-Level Adversaries”, co-authored with Piotr Zieliński. In it, we discussed the risk of traffic-analysis at Internet exchanges (IXes). We then showed that given even a small fraction of the data passing through an IX it was still possible to track a substantial proportion of anonymous communications. Our results are summarized in a previous blog post and full details are in the paper.

Our paper has now been announced as a runner-up for the Privacy Enhancing Technologies Award. The prize is presented annually, for research which makes an outstanding contribution to the field. Microsoft, the sponsor of the award, have further details and summaries of the papers in their press release.

Congratulations to the winners, Arvind Narayanan and Vitaly Shmatikov, for “Robust De-Anonymization of Large Sparse Datasets”; and the other runner-ups, Mira Belenkiy, Melissa Chase, C. Chris Erway, John Jannotti, Alptekin Küpçü, Anna Lysyanskaya and Erich Rachlin, for “Making P2P Accountable without Losing Privacy”.

Metrics for security and performance in low-latency anonymity systems

In Tor, and in other similar anonymity systems, clients choose a random sequence of computers (nodes) to route their connections through. The intention is that, unless someone is watching the whole network at the same time, the tracks of each user’s communication will become hidden amongst that of others. Exactly how a client chooses nodes varies between system to system, and is important for security.

If someone is simultaneously watching a user’s traffic as it enters and leaves the network, it is possible to de-anonymise the communication. As anyone can contribute nodes, this could occur if the first and last node for a connection is controlled by the same person. Tor takes some steps to avoid this possibility e.g. no two computers on the same /16 network may be chosen for each connection. However, someone with access to several networks could circumvent this measure.

Not only is route selection critical for security, but it’s also a significant performance factor. Tor nodes vary dramatically in their capacity, mainly due to their network connections. If all nodes were chosen with equal likelihood, the slower ones would cripple the network. This is why Tor weights the selection probability for a node proportional to its contribution to the network bandwidth.

Because of the dual importance of route selection, there are a number of proposals which offer an alternative to Tor’s bandwidth-weighted algorithm. Later this week at PETS I’ll be presenting my paper, co-authored with Robert N.M. Watson, “Metrics for security and performance in low-latency anonymity systems”. In this paper, we examine several route selection algorithms and evaluate their security and performance.

Intuitively, a route selection algorithm which weights all nodes equally appears the most secure because an attacker can’t make their node count any more than the others. This has been formalized by two measures: Gini coefficient and entropy. In fact the reality is more complex — uniform node selection resists attackers with lots of bandwidth, whereas bandwidth-weighting is better against attackers with lots of nodes (e.g. botnets).

Our paper explores the probability of path compromise of different route selection algorithms, when under attack by a range of different adversaries. We find that none of the proposals are optimal against all adversaries, and so summarizing effective security in terms of a single figure is not feasible. We also model the performance of the schemes and show that bandwidth-weighting offers both low latency and high resistance to attack by bandwidth-constrained adversaries.

Update (2008-07-25):
The slides (PDF 2.1M) for my presentation are now online.