Synthesis of Machine Code: Algorithms and Applications

Venkatesh Srinivasan
University of Wisconsin

The analysis of binaries has gotten an increasing amount of attention from the academic community in the last decade. The results of binary analysis have been predominantly used to answer questions about the properties of binaries. Another potential use of analysis results is to rewrite the binary via semantic transformations. Semantics-based binary-rewriting tools become particularly important when one wishes to modify the functionality of the binary, and the source code and/or compiler toolchain for the binary is unavailable. Semantics-based binary rewriting can be done for the purposes of binary optimization (offline optimization, superoptimization), software reuse (partial evaluation, slicing, binary translation), or software security (binary obfuscation, de-obfuscation).

This dissertation proposes various algorithms for synthesizing a machine-code implementation from a semantic specification of the desired behavior, and describes how one can use a machine-code synthesizer to build different semantics-based binary-rewriting tools.

The first part of this dissertation introduces the problem of machine-code synthesis, proposes a framework for semantics-based binary rewriting via machine-code synthesis, and presents several algorithms for machine-code synthesis. A key challenge in synthesizing machine code is the enormous size of the synthesis search-space. Instruction sets like Intel's IA-32 have around 43,000 unique instruction schemas; this huge instruction pool, along with the exponential cost inherent in enumerative synthesis, results in an enormous search space for the synthesizer: even for relatively small specifications, a naive synthesizer might take several days to find an implementation. We have developed a suite of techniques---ranging from pruning heuristics to machine learning---that assist an enumerative synthesizer in combating the enormous search space, and speeding up synthesis from the order of days to the order of seconds.

The second part of this dissertation describes how one can instantiate the aforementioned synthesizer-equipped framework to create different semantics-based binary rewriters. In particular, we have instantiated the framework to create two tools that rewrite binaries for purposes of software reuse: a novel machine-code partial evaluator, and an improved machine-code slicer. Among other potential uses, these tools can be used to either modify or extract a component from a binary that lacks source code.

Keywords: Machine-code synthesis, IA-32 instruction set, machine learning, partial evaluation, slicing

(Click here to access the dissertation: PDF.)