5 years ago, I compiled a dataset of password histograms representing roughly 70 million Yahoo! users. It was the largest password dataset ever compiled for research purposes. The data was a key component of my PhD dissertation the next year and motivated new statistical methods for which I received the 2013 NSA Cybersecurity Award.
I had always hoped to share the data publicly. It consists only of password histograms, not passwords themselves, so it seemed reasonably safe to publish. But without a formal privacy model, Yahoo! didn’t agree. Given the history of deanonymization work, caution is certainly in order. Today, thanks to new differential privacy methods described in a paper published at NDSS 2016 with colleagues Jeremiah Blocki and Anupam Datta, a sanitized version of the data is publicly available.
Lest there be any confusion, I’d like to stress this dataset does not include any individual passwords. What it does contain is 52 different histograms of how popular passwords were among different subsets of users at Yahoo! (as of 2011). For example, you can learn how popular the most popular password chosen by French-speaking users was, or see how popular was the 17th most popular password chosen by users between the ages of 35 and 44. This data was already a random subsample of Yahoo! users and we’ve since applied a sanitization algorithm to achieve differential privacy-the counts have all been very slightly modified to ensure that no individual user’s password significantly influenced the results. The details, of course, are in the paper.
Due to the aggregated nature of the original dataset, the necessary perturbations were small. As shown in Table III of the NDSS paper, it’s possible to closely reproduce the results from my original 2012 paper using the anonymized dataset (you can even use my source code).
I hope this dataset will be useful for future research on passwords and authentication. For example, it can be used to compare to new password datasets collected in the future, or to compute new metrics for guessing difficult. If you’re using the dataset, please get in touch!