In a seminar today, we will unveil Rendezvous, a search engine for code. Built by Wei-Ming Khoo, it will analyse an unknown binary, parse it into functions, index them, and compare them with a library of code harvested from open-source projects.
As time goes on, the programs we need to reverse engineer get ever larger, so we need better tools. Yet most code nowadays is not written from scratch, but cut and pasted. Programmers are not an order of magnitude more efficient than a generation ago; it’s just that we have more and better libraries to draw on nowadays, and a growing shared heritage of open software. So our idea is to reframe the decompilation problem as a search problem, and harness search-engine technology to the task.
As with a text search engine, Rendezvous uses a number of different techniques to index a target binary, some of which are described in this paper, along with the main engineering problems. As well as reverse engineering suspicious binaries, code search engines could be used for many other purposes such as monitoring GPL compliance, plagiarism detection, and quality control. On the dark side, code search can be used to find new instances of disclosed vulnerabilities. Every responsible software vendor or security auditor should build one. If you’re curious, here is the demo.
Identifing “common code” is sensable idea for a number of reasons over and above the ones you have mentioned.
One such is a code cutters private library of cut-n-paste code. From the early days of MFC it was obvious you had to write quite complex handler code to get any usefull milegeout of it.
Now because MS did such a lousy documentation job many of the more productive code cutters had their own “special code” that they would due to machismo and other reasons (job security) not give or alow others to see/use.
The result is that such code “finger printed” them.
The same still applies so you could find your program turned into a forensic tool as well…
@Clive The ZeuS rootkit, whose source was leaked in May 2011, used a non-standard string library. For example, strcmp has 4 arguments instead of the standard 2.
https://github.com/Visgean/Zeus/blob/translation/source/common/str.cpp#L875
@Wei @Clive If you’re interested in this issue, you may wish to check out the body of research in software engineering on “code clones” and “clone detection”.