Over a year ago, we blogged about a bug at Gawker which replaced all non-ASCII characters in passwords with ‘?’ prior to checking. Along with Rubin Xu and others I’ve investigated issues surrounding passwords, languages, and character encoding throughout the past year. This should be easy: websites using UTF-8 can accept any password and hash it into a standard format regardless of the writing system being used. Instead though, as we report a new paper which I presented last week at the Web 2.0 Security and Privacy workshop in San Francisco, passwords still localise poorly both because websites are buggy and users have been trained to type ASCII passwords only. This has broad implications for passwords’ role as a “universal” authentication mechanism.
After finding the Gawker bug we did an informal survey of about 20 popular websites looking for character encoding bugs in passwords. Roughly speaking, about a third of the websites we tried appear to handle long UTF-8 passwords seamlessly, about a third disallow non-ASCII characters in passwords as a matter of policy and we found bugs in the other third. Many of the bugs had no security impact, and others merely circumvented password policies. For example, Walmart and IMDB both count bytes submitted instead of characters. With non-ASCII characters replaced with numeric character references and then percent encoding, this can cause single UTF-8 characters to expand up to 15 bytes. With Walmart’s password policy limiting passwords to just 11 bytes, this means that a password with just two characters (like 密码) can be rejected for being too long. Other bugs are more serious-besides the Gawker bug, we discovered a lingering problem in many implementations of DES-crypt() which truncates passwords after any character with a 0x80 byte in their UTF-8 representation-including the character À (here’s an advisory for FreeBSD).
Of more fundamental interest, we found evidence that user behavior is significantly impacted by character encoding issues. In my study of password statistics at Yahoo!, I identified that common password dictionaries work effectively against all language groups. Examining leaked data from websites used primarily by Chinese and Hebrew speakers, we found that this is in part because users almost exclusively use ASCII passwords even when allowed to do otherwise. Most Chinese speakers rely on graphical Pinyin input methods, which are disabled for password fields to prevent shoulder-surfing; unsurprisingly Chinese characters are virtually non-existent in passwords. Hebrew speakers usually have a dual-mapped keyboard so Hebrew and Latin are equally easy to enter, but in a leaked data set where 90% of usernames contained Hebrew characters we found only 2.5% of passwords did. We even observed Hebrew speakers switching their keyboard mapping to the Latin alphabet and then typing Hebrew words (producing gibberish in ASCII). Users of non-ASCII variants of the Latin alphabet appear less trained to convert to ASCII: looking at Spanish passwords within the leaked RockYou set we found roughly half retained the non-ASCII character ‘ñ’, though nearly all users dropped stress accents which require escape keys to type (i.e. typing “pajaro” instead of “pájaro”).
More interestingly, we found that Chinese speakers (and Hebrew speakers to a lesser extent) were far more likely to use digits in their passwords or rely on a geometric keyboard pattern. This leads to a measurable security difference: the most common passwords in our leaked Chinese data sets were also far more common the most common passwords in leaked English language data sets (our Hebrew data set was too small to compute these statistics reliably). The irony is that linguistic diversity should help password security by making guessing more difficult. Instead, for roughly half the planet whose native writing system isn’t the Latin alphabet passwords appear less secure and more difficult to use as they must remember something in ASCII to ensure compatibility. It’s an interesting challenge to come up with a better solution for these users.
As a user of non-ASCII variants of the Latin alphabet, I can tell you that users often avoid non-ASCII characters because they had bad experiences with authentication systems that mess up passwords with non-ASCII characters. This often means being unable to access the system, or even being unable to change the password. Sites that accept non-ASCII characters may then mess-up things after something changes, e.g. a software component replacement. So, many learn that it is much safer to avoid non-ASCII characters.
Also, you may find many computer-savvy users that avoid many punctuation characters. This because they may use both US and local keyboard layouts (e.g. system administrators that move from one system to another), where many symbols may be missing or difficult to compose. Since passwords are usually not displayed, it may be difficult to understand why a password is rejected (yes, they could test the characters in the username field…).
Other things to take care of are keyboard layouts. E.g. german computer scientists often avoid z and y in their passwords. The reason is that German and English keyboard layouts are mostly identicial with only these two letters switched in position. As typing in a password field is blinded by *, you don’t know whether you are working on a computer with German or English keyboard layout.
(note: a lot of languages has horrible keyobard layouts for programming with characters like [ ; { in difficult to type positions)
… which is even made worse by the habit of some OSes *cough MacOSX cough* to switch back to the system default layout for some (not all) password dialogs — without indicating which one is active for the password field.
Add passwords for web services which you typically access from several differend devices and OSes (incl. your smartphone), and you will learn pretty quickly than anything exceeding ASCII is a major PITA, and to be avoided in your next password.
A strategy I sometimes use is to type the password into the username/login field, then cut and paste into the password field.
Lately, though, I ran into a site which seems to require each field to be entered character-by-character, which would make cut-and-paste a bit tedious. (But not impossible–I guess). Cheers!