One of the steps used by the attacker who compromised Light Blue Touchpaper a few weeks ago was to create an account (which he promoted to administrator; more on that in a future post). I quickly disabled the account, but while doing forensics, I thought it would be interesting to find out the account password. WordPress stores raw MD5 hashes in the user database (despite my recommendation to use salting). As with any respectable hash function, it is believed to be computationally infeasible to discover the input of MD5 from an output. Instead, someone would have to try out all possible inputs until the correct output is discovered.
So, I wrote a trivial Python script which hashed all dictionary words, but that didn’t find the target (I also tried adding numbers to the end). Then, I switched to a Russian dictionary (because the comments in the shell code installed were in Russian) but that didn’t work either. I could have found or written a better password cracker, which varies the case of letters, and does common substitutions (e.g. o → 0, a → 4) but that would have taken more time than I wanted to spend. I could also improve efficiency with a rainbow table, but this needs a large database which I didn’t have.
Instead, I asked Google. I found, for example, a genealogy page listing people with the surname “Anthony”, and an advert for a house, signing off “Please Call for showing. Thank you, Anthony”. And indeed, the MD5 hash of “Anthony” was the database entry for the attacker. I had discovered his password.
In both the webpages, the target hash was in a URL. This makes a lot of sense — I’ve even written code which does the same. When I needed to store a file, indexed by a key, a simple option is to make the filename the key’s MD5 hash. This avoids the need to escape any potentially dangerous user input and is very resistant to accidental collisions. If there are too many entries to store in a single directory, by creating directories for each prefix, there will be an even distribution of files. MD5 is quite fast, and while it’s unlikely to be the best option in all cases, it is an easy solution which works pretty well.
Because of this technique, Google is acting as a hash pre-image finder, and more importantly finding hashes of things that people have hashed before. Google is doing what it does best — storing large databases and searching them. I doubt, however, that they envisaged this use though. 🙂