Mention the phrase “binary reverse engineering” or “binary analysis” and it often conjures up an image of software pirates or hacking groups. However, there are practical reasons for doing analysis on machine code. For instance, machines don’t run source code, they run machine code – how do we know it’s running correctly? Malware doesn’t usually come with source code (but they are known to leak on occasion); How do we protect our software from discovered vulnerabilities if we’re unable to re-compile the program from the original source code? For three paper Thursday this week, my contribution is to highlight three representative security applications of binary analysis, namely software testing, malware analysis and software protection.
Towards Automatic Discovery of Deviations in Binary Implementations with Applications to Error Detection and Fingerprint Generation. David Brumley et al, Usenix Security Symposium 2007
If the same input given to two different implementations of the same protocol specification gives rise to a deviation in the output state, then an implementation error has been found.
This paper is an interesting read in my opinion because it took a non-obvious twist to the software verification problem. Given a Hoare triple {P}Ci{Q}, where P is the pre-condition, Ci, i=1,2 are the two implementations, and post-condition Q, we want to compute the weakest pre-condition fi = wp(Ci, Q). Function fi is a boolean formula over the input space of Ci such that if fi(x) = true, then Q is true. Then a deviation is likely to occur when either !f1(x) && f2(x) or f1(x) && !f2(x) is satisfiable. Computing the weakest pre-condition in real programs is challenging because it can easily result in a formula that is too large to solve. The paper dealt with this issue by keeping to a single execution path, making only input variables symbolic and by keeping Q simple. The authors then showed that even with these simplifications, they found several bugs, including one in the Miniweb HTTP server that assumed the first character of the URI to be a slash even when it was not. Another interesting application of this technique was to generate implementation fingerprints based on these deviations.
Inspector Gadget: Automated Extraction of Proprietary Gadgets from Malware Binaries. Clemens Kolbitsch et al. IEEE Symposium on Security & Privacy Symposium 2010
The goal is to extract stand-alone code fragments, for example an encryption routine E(), and derive its inversion, E-1(), automatically from malware.
The idea is neat, and the problem is a practical one for malware analysts. The initial step is a manual one. Given a log of the malware’s behaviour, the analyst specifies the data of interest (in this case a particular API call), and the Inspector takes care of the rest. There are at least two challenges in this task. Firstly, extracting the code, or gadget, is challenging. Given a target location T, Inspector has to perform a program slice on T. In the worst case, this leads to all instructions being included in the slice. Inspector deals with this issue using various heuristics to determine suitable end-points. Secondly, and in my opinion the more difficult problem, creating a gadget replayer is non-trivial since its supposed to run on bare metal. This requires that memory buffers be relocated, all errors to be properly handled and the API call interface to be carefully configured so that its execution stays true to the original but yet is contained safely. To invert a function E(), Inspector uses a brute-force approach on a sub-set of the output space. It was slightly disappointing that the authors did not use SAT/SMT solvers for this task, although it was mentioned as a possible extension.
Howard : a Dynamic Excavator for Reverse Engineering Data Structures by Slowinska et al. Network and Distributed System Security Symposium (NDSS 2011)
Data structures can be identified via dynamic memory access patterns and retro-fitted with buffer overflow protection
The problem addressed in this paper is recovering variables types from machine code, specifically locally defined data structures. The key insight is that even though the bytes on the stack look anonymous and the code is highly optimised, the way they are accessed provides a means to infer their structure. If a memory buffer is accessed in strides of a word size, then we may infer that the buffer contains a word-sized array. If the buffer has more than one possible stride pattern, then the “least common pattern” heuristic is used. However, sometimes the structure is not accessed all at once, and Howard makes use of the fact that the accesses share a common base pointer. Howard does per-execution analysis and relies on KLEE to generate a comprehensive set of test input. The buffer protection mechanism is similar to that of WIT – all buffers are given a colour. When a pointer is initialised to the buffer, it is assigned that colour. Whenever it does a dereferenced write, a check is made to ensure that the pointer and buffer colours coincide. On the whole, this paper addresses a difficult problem with a nice solution and rounds it off with a practical security application.
There are certainly other problems in binary analysis that I have not highlighted, such as in the field of computer forensics. Feel free to suggest other interesting papers in the comments!