Password cracking has returned to the news, with a thorough Ars Technica article on the increasing potency of cracking tools and the third Crack Me If You Can contest at this year’s DEFCON. Taking a critical view, I’ll argue that it’s not clear exactly how much password cracking is improving and that the cracking community could do a much better job of measuring progress.
Password cracking can be evaluated on two nearly independent axes: power (the ability to check a large number of guesses quickly and cheaply using optimized software, GPUs, FPGAs, and so on) and efficiency (the ability to generate large lists of candidate passwords accurately ranked by real-world likelihood using sophisticated models). It’s relatively simple to measure cracking power in units of hashes evaluated per second or hashes per second per unit cost. There are details to account for, like the complexity of the hash being evaluated, but this problem is generally similar to cryptographic brute force against unknown (random) keys and power is generally increasing exponentially in tune with Moore’s law. The move to hardware-based cracking has enabled well-documented orders-of-magnitude speedups.
Cracking efficiency, by contrast, is rarely measured well. Useful data points, some of which I curated in my PhD thesis, consist of the number of guesses made against a given set of password hashes and the proportion of hashes which were cracked as a result. Ideally many such points should be reported, allowing us to plot a curve showing the marginal returns as additional guessing effort is expended. Unfortunately results are often stated in terms of the total number of hashes cracked (here are some examples). Sometimes the runtime of a cracking tool is reported, which is an improvement but conflates efficiency with power.
When we do have efficiency data, it’s often not contextualized against simple baseline attacks. As an example, consider KoreLogic’s impressive announcement of cracking 93% of a real password leak in 24 seconds (the hashes leaked from Military Singles, an online dating site). One baseline is just trying everything from the RockYou leak of about 14 million (unique) passwords, which cracked 63.1% of the hashes (in under a minute, for what that’s worth). Trying an expanded list of about 100 million leaked passwords I’ve collected from other leaks broke 73% of the accounts. I don’t mean to downplay KoreLogic’s results, but to point out that a reasonable wordlist gets you most of the way there. It would be interesting to be able to compare how efficiently KoreLogic could do with the same budget of 14 or 100 million guesses, though I expect the difference is very small. For the (unreported) number of additional guesses KoreLogic used to get to 93%, the efficiency should be compared with brute force. If they evaluated a trillion guesses using some proprietary generation method, how much better was this than just using a good word list followed by a trillion guesses of random 6-8 character passwords? I haven’t seen this kind of analysis provided since Weir et al.’s 2009 Oakland paper.
Public contests like Crack Me If You Can are a great way to motivate research, but have thus far failed to develop any repeatable efficiency benchmarks. One issue is the contest’s use of algorithmically-generated passwords, meaning the cracking tools were measured not in how efficiently they simulate real-world password choices but how well they simulate the organiser’s model. A second issues is that it was simultaneously a contest of power and efficiency, meaning we don’t know if the winners (the developers of HashCat) were guessing more efficiently than their main rivals (using John The Ripper) or simply deployed more guessing power. I’d propose holding one contest purely of power, in which teams compete to break pseudorandom passwords (possibly with a limited budget of electricity or money) and a second contest of efficiency, where teams submit an ordered list of guesses and their efficiency is plotted by evaluating against a never-before-released set of human-chosen passwords. The second part of the contest would be difficult to synchronise with DEFCON, since it relies on new test data being available. It doesn’t require a huge amount of data, though-a list of 10,000 or so real hashes could be “donated” to the contest from defunct system accounts, perhaps. Alternately, an organized contest could be held on short notice whenever a new leak becomes available.
In many ways, the increase in available password data has spurred a renaissance in password cracking. Many new people are getting involved and exciting things are happening, such as HashCat emerging as a freely-available tool on par with John The Ripper. The password cracking community must devote some attention though to developing a scientific way to measure progress for the benefit of security engineers.
This post is part 1 of a 2 part series on password cracking. Tomorrow I’ll discuss the relevance of password cracking to web passwords. On a personal note, I’ve completed my PhD dissertation on guessing statistics and will soon be joining Google to take on new research challenges. This post reflects my opinions only and not those of my future employer.
Thanks to Moxie Marlinspike, Richard Clayton and Sören Preibusch for reviewing a draft of this post.
I believe the JtR team (I am a member) was much stronger with regards to efficiency, as it scored a lot more against the “hard” hashes. Those hashes need very attacks that are very targeted, as they are both very slow and salted. On the other hand, the raw power of team Hashcat meant they were untouchable on the non salted hashes.
As for the contest idea, you could use a contest based on “rounds”, where it is possible to submit a decreasing amount of clear texts during each rounds, and teams get the list of passwords that were cracked. Rounds would be separated by a predefined amount of time.
This would be nice for the spectators as teams could not hold their hashes till the last second, but the danger of colluding teams is acute.