Static Analysis of Executables to Detect Malicious Patterns^*

Mihai Christodorescu
mihai@cs.wisc.edu Somesh Jha
mihai@cs.wisc.edu

Computer Sciences Department
University of Wisconsin, Madison

Abstract

Malicious code detection is a crucial component of any defense mechanism. In this paper, we present a unique viewpoint on malicious code detection. We regard malicious code detection as an obfuscation-deobfuscation game between malicious code writers and researchers working on malicious code detection. Malicious code writers attempt to obfuscate the malicious code to subvert the malicious code detectors, such as anti-virus software. We tested the resilience of three commercial virus scanners against code-obfuscation attacks. The results were surprising: the three commercial virus scanners could be subverted by very simple obfuscation transformations! We present an architecture for detecting malicious patterns in executables that is resilient to common obfuscation transformations. Experimental results demonstrate the efficacy of our prototype tool, SAFE (a static analyzer for executables).

This work was supported in part by the Office of Naval Research under contracts N00014-01-1-0796 and N00014-01-1-0708. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes, notwithstanding any copyright notices affixed thereon.
The views and conclusions contained herein are those of the authors, and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the above government agencies or the U.S. Government.

1 Introduction

In the interconnected world of computers, malicious code has become an omnipresent and dangerous threat. Malicious code can infiltrate hosts using a variety of methods such as attacks against known software flaws, hidden functionality in regular programs, and social engineering. Given the devastating effect malicious code has on our cyber infrastructure, identifying malicious programs is an important goal. Detecting the presence of malicious code on a given host is a crucial component of any defense mechanism.

Malicious code is usually classified [16] according to its propagation method and goal into the following categories:

viruses are programs that self-replicate within a host by attaching themselves to programs and/or documents that become carriers of the malicious code;
worms self-replicate across a network;
trojan horses masquerade as useful programs, but contain malicious code to attack the system or leak data;
back doors open the system to external entities by subverting the local security policies to allow remote access and control over a network;
spyware is a useful software package that also transmits private user data to an external entity.

Combining two or more of these malicious code categories can lead to powerful attack tools. For example, a worm can contain a payload that installs a back door to allow remote access. When the worm replicates to a new system (via email or other means), the back door is installed on that system, thus providing an attacker with a quick and easy way to gain access to a large set of hosts. Staniford et al. have demonstrated that worms can propagate extremely quickly through a network, and thus potentially cripple the entire cyber infrastructure [1]. In a recent outbreak, the Sapphire/SQL Slammer worm reached the peak infection rate in about 10 minutes since launch, doubling every 8.5 seconds [32]. Once the back-door tool gains a large installed base, the attacker can use the compromised hosts to launch a coordinated attack, such as a distributed denial-of-service (DDoS) attack [2].

In this paper, we develop a methodology for detecting malicious patterns in executables. Although our method is general, we have initially focused our attention on viruses. A computer virus replicates itself by inserting a copy of its code (the viral code) into a host program. When a user executes the infected program, the virus copy runs, infects more programs, and then the original program continues to execute. To the casual user, there is no perceived difference between the clean and the infected copies of a program until the virus activates its malicious payload.

The classic virus-detection techniques look for the presence of a virus-specific sequence of instructions (called a virus signature) inside the program: if the signature is found, it is highly probable that the program is infected. For example, the Chernobyl/CIH virus is detected by checking for the hexadecimal sequence [5]:

E800 0000 005B 8D4B 4251 5050
0F01 4C24 FE5B 83C3 1CFA 8B2B

This corresponds to the following IA-32 instruction sequence, which constitutes part of the virus body:

E8 00000000 call 0h 5B pop ebx 8D 4B 42 lea ecx, [ebx + 42h] 51 push ecx 50 push eax 50 push eax 0F01 4C 24 FE sidt [esp - 02h] 5B pop ebx 83 C3 1C add ebx, 1Ch FA cli 8B 2B mov ebp, [ebx]

This classic detection approach is effective when the virus code does not change significantly over time. Detection is also easier when viruses originate from the same source code, with only minor modifications and updates. Thus, a virus signature can be common to several virus variants. For example, Chernobyl/CIH versions 1.2, 1.3, and 1.4 differ mainly in the trigger date on which the malicious code becomes active and can be effectively detected by scanning for a single signature, namely the one shown above.

The virus writers and the antivirus software developers are engaged in an obfuscation-deobfuscation game. Virus writers try to obfuscate the "vanilla" virus so that signatures used by the antivirus software cannot detect these "morphed" viruses. Therefore, to detect an obfuscated virus, the virus scanners first must undo the obfuscation transformations used by the virus writers. In this game, virus writers are obfuscators and researchers working on malicious code detection are deobfuscators. A method to detect malicious code should be resistant to common obfuscation transformations. This paper introduces such a method. The main contributions of this paper include:

• The obfuscation-deobfuscation game and attacks on commercial virus scanners
We view malicious code detection as an obfuscation-deobfuscation game between the virus writers and the researchers working to detect malicious code. Background on some common obfuscation techniques used by virus writers is given in Section 3. We also have developed an obfuscator for executables. Surprisingly, the three commercial virus scanners we considered could be easily thwarted by simple obfuscation transformations (Section 4). For example, in some cases the Norton antivirus scanner could not even detect insertions of nop instructions.

• A general architecture for detecting malicious patterns in executables
We introduce a general architecture for detecting malicious patterns in executables. An overview of the architecture and its novel features is given in Section 5. External predicates and uninterpreted symbols are two important elements in our architecture. External predicates are used to summarize results of various static analyses, such as points-to and live-range analysis. We allow these external predicates to be referred in the abstraction patterns that describe the malicious code. Moreover, we allow uninterpreted symbols in patterns, which makes the method resistant to renaming, a common obfuscation transformation. Two key components of our architecture, the program annotator and the malicious code detector, are described in Sections 6 and 7 respectively.

• Prototype for x86 executables
We have implemented a prototype for detecting malicious patterns in x86 executables. The tool is called a static analyzer for executables or SAFE. We have successfully tried SAFE on multiple viruses; for brevity we report on our experience with four specific viruses. Experimental results (Section 8) demonstrate the efficacy of SAFE. There are several interesting directions we intend to pursue as future work, which are summarized in Section 9.

• Extensibility of analysis
SAFE depends heavily on static analysis techniques. As a result, the precision of the tool directly depends on the static analysis techniques that are integrated into it. In other words, SAFE is as good as the static analysis techniques it is built upon. For example, if SAFE uses the result of points-to analysis, it will be able to track values across memory references. In the absence of a points-to analyzer, SAFE makes the conservative assumption that a memory reference can access any memory location (i.e., everything points to everything). We have designed SAFE so that various static analysis techniques can be readily integrated into it. Several simple static analysis techniques are already implemented in SAFE.

2 Related Work

2.1 Theoretical Discussion

The theoretical limits of malicious code detection (specifically of virus detection) have been the focus of many researchers. Cohen [13] and Chess-White [14] showed that in general the problem of virus detection is undecidable. Similarly, several important static analysis problems are undecidable or computationally hard [31, 30].

However, the problem considered in this paper is slightly different than the one considered by Cohen [13] and Chess-White [14]. Assume that we are given a vanilla virus V which contains a malicious sequence of instructions τ. Next we are given an obfuscated version O(V) of the virus. The problem is to find whether there exists a sequence of instructions τ in O(V) which is "semantically equivalent" to τ. A recent result by Vadhan et al. [25] proves that in general program obfuscation is impossible. This leads us to believe that a computationally bounded adversary will not be able to obfuscate a virus to completely hide its malicious behavior. We will further explore these theoretical issues in the future.

2.2 Other Detection Techniques

Our work is closely related to previous results on static analysis techniques for verifying security properties of software [42, 47, 44, 39, 40, 48]. In a larger context, our work is similar to existing research on software verification [41, 45]. However, there are several important differences. First, viewing malicious code detection as an obfuscation-deobfuscation game is unique. The obfuscation-deobfuscation viewpoint lead us to explore obfuscation attacks upon commercial virus scanners. Second, to our knowledge, all existing work on static analysis techniques for verifying security properties analyze source code. On the other hand, our analysis technique works on executables. In certain contexts, such as virus detection, source code is not available. Finally, we believe that using uninterpreted variables in the specification of the malicious code is unique (Section 6.2).

Currie et al. looked at the problem of automatically checking the equivalence of DSP routines in the context of verifying the correctness of optimizing transformations [33]. Their approach is similar to ours, but they impose a set of simplifying assumptions for their simulation tool to execute with reasonable performance. Feng and Hu took this approach one step further by using a theorem prover to determine when to unroll loops [34]. In both cases the scope of the problem is limited to VLIW or DSP code and there is limited support for user-specified analyses. Our work is applied to x86 (IA-32) assembly and can take full advantage of static analyses available to the user of our SAFE tool. Necula adopts a similar approach based on comparing a transformed code sequence against the original code sequence in the setting of verifying the correctness of the GNU C compiler [35]. Using knowledge of the transformations performed by the compiler, equivalence between the compiler input and the compiler output is proven using a simulation relation. In our work, we require no a priori knowledge of the obfuscation transformations performed, as it would be unrealistic to expect such information in the presence of malicious code.

We plan to enhance our framework by using the ideas from existing work on type systems for assembly code. We are currently investigating Morrisett et al.'s Typed Assembly Language [27, 26]. We apply a simple type system (Section 6) to the binaries we analyze by manually inserting the type annotations. We are unaware of a compiler that can produce Typed Assembly Language, and thus we plan to support external type annotations to enhance the power of our static analysis.

Dynamic monitoring can also be used for malicious code detection. Cohen [13] and Chess-White [14] propose a virus detection model that executes code in a sandbox. Another approach rewrites the binary to introduce checks driven by an enforceable security policy [43] (known as the inline reference monitor or the IRM approach). We believe static analysis can be used to improve the efficiency of dynamic analysis techniques, e.g., static analysis can remove redundant checks in the IRM framework. We construct our models for executables similar to the work done in specification-based monitoring [29, 28], and apply our detection algorithm in a context-insensitive fashion. Other research used context-sensitive analysis by employing push-down systems (PDSs). Analyses described in [39, 40] use the model checking algorithms for pushdown systems [38] to verify security properties of programs. The data structures used in interprocedural slicing [36], interprocedural DFA [37], and Boolean programs [41] are hierarchically structured graphs and can be translated to pushdown systems.

2.3 Other Obfuscators

While deciding on the initial obfuscation techniques to focus on, we were influenced by several existing tools. Mistfall (by z0mbie) is a library for binary obfuscation, specifically written to blend malicious code into a host program [9]. It can encrypt, morph, and blend the virus code into the host program. Our binary obfuscator is very similar to Mistfall. Unfortunately, we could not successfully morph binaries using Mistfall, so we could not perform a direct comparison between our obfuscator and Mistfall. Burneye (by TESO) is a Linux binary encapsulation tool. Burneye encrypts a binary (possibly multiple times), and packages it into a new binary with an extraction tool [10]. In this paper, we have not considered encryption based obfuscation techniques. In the future, we will incorporate encryption based obfuscation techniques into our tool, by incorporating or extending existing libraries.

3 Background on Obfuscating Viruses

To detect obfuscated viruses, antivirus software have become more complex. This section discusses some common obfuscation transformations used by virus writers and how antivirus software have historically dealt with obfuscated viruses.

A polymorphic virus uses multiple techniques to prevent signature matching. First, the virus code is encrypted, and only a small in-clear routine is designed to decrypt the code before running the virus. When the polymorphic virus replicates itself by infecting another program, it encrypts the virus body with a newly-generated key, and it changes the decryption routine by generating new code for it. To obfuscate the decryption routine, several transformations are applied to it. These include: nop-insertion, code transposition (changing the order of instructions and placing jump instructions to maintain the original semantics), and register reassignment (permuting the register allocation). These transformations effectively change the virus signature (Figure 1), inhibiting effective signature scanning by an antivirus tool.

Original code		Obfuscated code
E8 00000000	call 0h	E8 00000000	call 0h
E8 00000000	call 0h	E8 00000000	call 0h
5B	pop ebx	5B	pop ebx
8D 4B 42	lea ecx, [ebx + 42h]	8D 4B 42	lea ecx, [ebx + 45h]
51	push ecx	90	nop
50	push eax	51	push ecx
50	push eax	50	push eax
0F01 4C 24 FE	sidt [esp - 02h]	50	push eax
5B	pop ebx	90	nop
83 C3 1C	add ebx, 1Ch	0F01 4C 24 FE	sidt [esp - 02h]
FA	cli	5B	pop ebx
8B 2B	mov ebp, [ebx]	83 C3 1C	add ebx, 1Ch
		90	nop
		FA	cli
		8B 2B	mov ebp, [ebx]

Signature		New signature
E800 0000 005B 8D4B 4251 5050		E800 0000 005B 8D4B 4290 5150
0F01 4C24 FE5B 83C3 1CFA 8B2B		5090 0F01 4C24 FE5B 83C3 1C90

Figure 1: Original code and obfuscated code from Chernobyl/CIH, and their corresponding signatures. Newly added instructions are highlighted.

The obfuscated code in Figure 1 will behave in the same manner as before since the nop instruction has no effect other than incrementing the program counter¹. However the signature has changed. Analysis can detect simple obfuscations, like nop-insertion, by using regular expressions instead of fixed signatures. To catch nop insertions, the signature should allow for any number of nops at instruction boundaries (Figure 2). In fact, most modern antivirus software use regular expressions as virus signatures.


E800	0000	00(90)*	5B(90)*
8D4B	42(90)*	51(90)*	50(90)*
50(90)*	0F01	4C24	FE(90)*
5B(90)*	83C3	1C(90)*	FA(90)*
8B2B

Figure 2: Extended signature to catch nop-insertion.

Antivirus software deals with polymorphic viruses by performing heuristic analyses of the code (such as checking only certain program locations for virus code, as most polymorphic viruses attach themselves only at the beginning or end of the executable binary [18]), and even emulating the program in a sandbox to catch the virus in action [17]. The emulation technique is effective because at some point during the execution of the infected program, the virus body appears decrypted in main memory, ready for execution; the detection comes down to frequently scanning the in-memory image of the program for virus signatures while the program executes.

Metamorphic viruses attempt to evade heuristic detection techniques by using more complex obfuscations. When they replicate, these viruses change their code in a variety of ways, such as code transposition, substitution of equivalent instruction sequences, and register reassignment [11, 7]. Furthermore, they can "weave" the virus code into the host program, making detection by traditional heuristics almost impossible since the virus code is mixed with program code and the virus entry point is no longer at the beginning of the program (these are designated as entry point obscuring (EPO) viruses [15]).

As virus writers employ more complex obfuscation techniques, heuristic virus-detection techniques are bound to fail. Therefore, there is need to perform a deeper analysis of malicious code based upon more sophisticated static-analysis techniques. In other words, inspection of the code to detect malicious patterns should use structures that are closer to the semantics of the code, as purely syntactic techniques, such as regular expression matching, are no longer adequate.

3.1 The Suite of Viruses

We have analyzed multiple viruses using our tool, and discuss four of them in this paper. Descriptions of these viruses are given below.

3.1.1 Detailed Description of the Viruses

Chernobyl (CIH)
According to the Symantec Antivirus Reseach Center (SARC), Chernobyl/CIH is a virus that infects 32-bit Windows 95/98/NT executable files [22]. When a user executes an infected program under Windows 95/98/ME, the virus becomes resident in memory. Once the virus is resident, CIH infects other files when they are accessed. Infected files may have the same size as the original files because of CIH's unique mode of infection: the virus searches for empty, unused spaces in the file². Next it breaks itself up into smaller pieces and inserts its code into these unused spaces. Chernobyl has two different payloads: the first one overwrites the hard disk with random data, starting at the beginning of the disk (sector 0) using an infinite loop. The second payload tries to cause permanent damage to the computer by corrupting the Flash BIOS.

zombie-6.b
The z0mbie-6.b virus includes an interesting feature -- the polymorphic engine hides every piece of the virus, and the virus code is added to the infected file as a chain of differently-sized routines, making standard signature detection techniques almost useless.

f0sf0r0
The f0sf0r0 virus uses a polymorphic engine combined with an EPO technique to hide its entry point. According to Kaspersky Labs [21], when an infected file is run and the virus code gains control, it searches for portable executable files in the system directories and infects them. While infecting, the virus encrypts itself with a polymorphic loop and writes a result to the end of the file. To gain control when the infected file is run, the virus does not modify the program's start address, but instead writes a "jmp <virus_entry>" instruction into the middle of the file.

Hare
Finally, the Hare virus infects the bootloader sectors of floppy disks and hard drives, as well as executable programs. When the payload is triggered, the virus overwrites random sectors on the hard disk, making the data inaccessible. The virus spreads by polymorphically changing its decryption routine and encrypting its main body.

The Hare and Chernobyl/CIH viruses are well known in the antivirus community, with their presence in the wild peaking in 1996 and 1998, respectively. In spite of this, we discovered that current commercial virus scanners could not detect slightly obfuscated versions of these viruses.

4 Obfuscation Attacks on Commercial Virus Scanners

We tested three commercial virus scanners against several common obfuscation transformations. To test the resilience of commercial virus scanners to common obfuscation transformations, we have developed an obfuscator for binaries. Our obfuscator supports four common obfuscation transformations: dead-code insertion, code transposition, register reassignment, and instruction substitution. While there are other generic obfuscation techniques [12, 23], those described here seem to be preferred by malicious code writers, possibly because implementing them is easy and they add little to the memory footprint.

4.1 Common Obfuscation Transformations

4.1.1 Dead-Code Insertion

Also known as trash insertion, dead-code insertion adds code to a program without modifying its behavior. Inserting a sequence of nop instructions is the simplest example. More interesting obfuscations involve constructing challenging code sequences that modify the program state, only to restore it immediately.

Some code sequences are designed to fool antivirus software that solely rely on signature matching as their detection mechanism. Other code sequences are complicated enough to make automatic analysis very time-consuming, if not impossible. For example, passing values through memory rather than registers or the stack requires accurate pointer analysis to recover values. The example shown in Figure 3 should clarify this. The code marked by (*) can be easily eliminated by automated analysis. On the other hand, the second and third insertions, marked by (**), do cancel out but the analysis is more complex. Our obfuscator supports dead-code insertion.

Not all dead-code sequence can be detected and eliminated, as this problem reduces to program equivalence (i.e., is this code sequence equivalent to an empty program?), which is undecidable. We believe that many common dead-code sequences can be detected and eliminated with acceptable performance. To quote the documentation of the RPME virus permutation engine [8],

[T]rash [does not make the] program more complex [...]. If [the] detecting algorithm will be written such as I think, then there is no difference between NOP and more complex trash.

Our detection tool, SAFE, identifies several kinds of such dead-code segments.

Original code	Code obfuscated through dead-code insertion		Code obfuscated through code transposition		Code obfuscated through instruction substitution
call 0h	call 0h			call 0h	call 0h
pop ebx	pop ebx			pop ebx	pop ebx
lea ecx, [ebx+42h]	lea ecx, [ebx+42h]			jmp S2	lea ecx, [ebx+42h]
push ecx	nop	(*)	S3:	push eax	sub esp, 03h
push eax	nop	(*)		push eax	sidt [esp - 02h]
push eax	push ecx			sidt [esp - 02h]	add [esp], 1Ch
sidt [esp - 02h]	push eax			jmp S4	mov ebx, [esp]
pop ebx	inc eax	(**)		add ebx, 1Ch	inc esp
add ebx, 1Ch	push eax			jmp S6	cli
cli	dec [esp - 0h]	(**)	S2:	lea ecx, [ebx+42h ]	mov ebp, [ebx]
mov ebp, [ebx]	dec eax	(**)		push ecx
	sidt [esp - 02h]			jmp S3
	pop ebx		S4:	pop ebx
	add ebx, 1Ch			cli
	cli			jmp S5
	mov ebp, [ebx]		S5:	mov ebp, [ebx]

Figure 3: Examples of obfuscation through dead-code insertion, code transposition, and instruction substitution. Newly added instructions are highlighted.

4.1.2 Code Transposition

Code transposition shuffles the instructions so that the order in the binary image is different from the execution order, or from the order of instructions assumed in the signature used by the antivirus software. To achieve the first variation, we randomly reorder the instructions and insert unconditional branches or jumps to restore the original control-flow. The second variation swaps instructions if they are not interdependent, similar to compiler code generation, but with the different goal of randomizing the instruction stream.

The two versions of this obfuscation technique differ in their complexity. The code transposition technique based upon unconditional branches is relatively easy to implement. The second technique that interchanges independent instructions is more complicated because the independence of instructions must be ascertained. On the analysis side, code transposition can complicate matters only for a human. Most automatic analysis tools (including ours) use an intermediate representation, such as the control flow graph (CFG) or the program dependence graph (PDG) [36], that is not sensitive to superfluous changes in control flow. Note that an optimizer acts as a deobfuscator in this case by finding the unnecessary unconditional branches and removing them from the program code. Currently, our obfuscator supports only code transposition based upon inserting unconditional branches.

4.1.3 Register Reassignment

The register reassignment transformation replaces usage of one register with another in a specific live range. This technique exchanges register names and has no other effect on program behavior. For example, if register ebx is dead throughout a given live range of the register eax, it can replace eax in that live range. In certain cases, register reassignment requires insertion of prologue and epilogue code around the live range to restore the state of various registers. Our binary obfuscator supports this code transformation.

The purpose of this transformation is to subvert the antivirus software analyses that rely upon signature-matching. There is no real obfuscatory value gained in this process. Conceptually, the deobfuscation challenge is equally complex before or after the register reassignment.

4.1.4 Instruction Substitution

This obfuscation technique uses a dictionary of equivalent instruction sequences to replace one instruction sequence with another. Since this transformation relies upon human knowledge of equivalent instructions, it poses the toughest challenge for automatic detection of malicious code. The IA-32 instruction set is especially rich, and provides several ways of performing the same operation. Coupled with several architecturally ambivalent features (e.g., a memory-based stack that can be accessed both as a stack using dedicated instructions and as a memory area using standard memory operations), the IA-32 assembly language provides ample opportunity for instruction substitution.

To handle obfuscation based upon instruction substitution, an analysis tool must maintain a dictionary of equivalent instruction sequences, similar to the dictionary used to generate them. This is not a comprehensive solution, but it can cope with the common cases. In the case of IA-32, the problem can be slightly simplified by using a simple intermediate language that "unwinds" the complex operations corresponding to each IA-32 instruction. In some cases, a theorem prover such as Simplify [49] or PVS [46] can also be used to prove that two sequences of instructions are equivalent.

4.2 Testing Commercial Antivirus Tools

We tested three commercial virus scanners using obfuscated versions of the four viruses described earlier. The results were quite surprising: a combination of nop-insertion and code transposition was enough to create obfuscated versions of the viruses that the commercial virus scanners could not detect. Moreover, the Norton antivirus software could not detect an obfuscated version of the Chernobyl virus using just nop-insertions. SAFE was resistant to the two obfuscation transformations. The results are summarized in Table 1. A √ indicates that the antivirus software detected the virus. A × means that the software did not detect the virus. Note that unobfuscated versions of all four viruses were detected by all the tools.

		Norton^® Antivirus 7.0	McAfee^® VirusScan 6.01	Command^® Antivirus 4.61.2	SAFE
Chernobyl	original	√	√	√	√
Chernobyl	obfuscated	×^[1]	×^[1,2]	×^[1,2]	√
z0mbie-6.b	original	√	√	√	√
z0mbie-6.b	obfuscated	×^[1,2]	×^[1,2]	×^[1,2]	√
f0sf0r0	original	√	√	√	√
f0sf0r0	obfuscated	×^[1,2]	×^[1,2]	×^[1,2]	√
Hare	original	√	√	√	√
Hare	obfuscated	×^[1,2]	×^[1,2]	×^[1,2]	√

Obfuscations considered:	^[1] = `nop`-insertion (a form of dead-code insertion)
	^[2] = code transposition

Table 1: Results of testing various virus scanners on obfuscated viruses.

5 Architecture

This section gives an overview of the architecture of SAFE (Figure 4). Subsequent sections provide detailed descriptions of the major components of SAFE.

Figure 4: Architecture of the static analyzer for executables (SAFE).

To detect malicious patterns in executables, we build an abstract representation of the malicious code (here a virus). The abstract representation is the "generalization" of the malicious code, e.g., it incorporates obfuscation transformations, such as superfluous changes in control flow and register reassignments. Similarly, one must construct an abstract representation of the executable in which we are trying to find a malicious pattern. Once the generalization of the malicious code and the abstract representation of the executable are created, we can then detect the malicious code in the executable. We now describe each component of SAFE.

Generalizing the malicious code: Building the malicious code automaton
The malicious code is generalized into an automaton with uninterpreted symbols. Uninterpreted symbols (Section 6.2) provide a generic way of representing data dependencies between variables without specifically referring to the storage location of each variable.

Pattern-definition loader
This component takes a library of abstraction patterns and creates an internal representation. These abstraction patterns are used as alphabet symbols by the malicious code automaton.

The executable loader
This component transforms the executable into an internal representation, here the collection of control flow graphs (CFGs), one for each program procedure. The executable loader (Figure 5) uses two off-the-shelf components, IDA Pro and CodeSurfer. IDA Pro (by DataRescue [20]) is a commercial interactive disassembler. CodeSurfer (by GrammaTech, Inc. [19]) is a program-understanding tool that performs a variety of static analyses. CodeSurfer provides an API for access to various structures, such as the CFGs and the call graph, and to results of a variety of static analyses, such as points-to analysis. In collaboration with GrammaTech, we have developed a connector that transforms IDA Pro internal structures into an intermediate form that CodeSurfer can parse.

Figure 5: Implementation of executable loader module.

The annotator
This component inputs a CFG from the executable and the set of abstraction patterns and produces an annotated CFG, the abstract representation of a program procedure. The annotated CFG includes information that indicates where a specific abstraction pattern was found in the executable. The annotator runs for each procedure in the program, transforming each CFG. Section 6 describes the annotator in detail.

The detector
This component computes whether the malicious code (represented by the malicious code automaton) appears in the abstract representation of the executable (created by the annotator). This component uses an algorithm based upon language containment and unification. Details can be found in Section 7.

Throughout the rest of the paper, the malicious code fragment shown in Figure 6 is used as a running example. This code fragment was extracted from the Chernobyl virus version 1.4.

Original Code

WVCTF:
	mov	eax, dr1
	mov	ebx, [eax+10h]
	mov	edi, [eax]
LOWVCTF:
	pop	ecx
	jecxz	SFMM
	mov	esi, ecx
	mov	eax, 0d601h
	pop	edx
	pop	ecx
	call	edi
	jmp	LOWVCTF
SFMM:
	pop	ebx
	pop	eax
	stc
	pushf

Figure 6: Original code fragment from Chernobyl virus version 1.4.

To obtain the obfuscated code fragment depicted (Figure 7), we applied the following obfuscation transformations: dead-code insertion, code transposition, and register reassignment. Incidentally, the three commercial antivirus software (Norton, McAfee, and Command) detected the original code fragment shown. However, the obfuscated version was not detected by any of the three commercial antivirus software.

Obfuscated Code

WVCTF:
	mov	eax, dr1
	jmp	Loc1
Loc2:
	mov	edi, [eax]
LOWVCTF:
	pop	ecx
	jecxz	SFMM
	nop
	mov	esi, ecx
	nop
	nop
	mov	eax, 0d601h
	jmp	Loc3
Loc1:
	mov	ebx, [eax+10h]
	jmp	Loc2
Loc3:
	pop	edx
	pop	ecx
	nop
	call	edi
	jmp	LOWVCTF
SFMM:
	pop	ebx
	pop	eax
	push	eax
	pop	eax
	stc
	pushf

Figure 7: Obfuscated version based upon code in Figure 6.

6 Program Annotator

This section describes the program annotator in detail and the data structures and static analysis concepts used in the detection algorithm. The program annotator inputs the CFG of the executable and a set of abstraction patterns and outputs an annotated CFG. The annotated CFG associates with each node n in the CFG a set of patterns that match the program at the point corresponding to the node n. The precise syntax for an abstraction pattern and the semantics of matching are provided later in the section.

Figure 8 shows the CFG and a simple annotated CFG corresponding to the obfuscated code from Figure 7. Note that one node in the annotated CFG can correspond to several nodes in the original CFG. For example, the nodes annotated with "IrrelevantInstr" corresponds to one or more nop instructions.

The annotations that appear in Figure 8 seem intuitive, but formulating them within a static-analysis framework requires formal definitions. We enhance the SAFE framework with a type system for x86 based on the typestate system described in [6]. However, other type systems designed for assembly languages, such as Typed Assembly Language [27, 26], could be used in the SAFE framework. Definitions, patterns, and the matching procedure are described in Sections 6.1, 6.2 and 6.3 respectively.

Figure 8: Control flow graph of obfuscated code fragment, and annotations.

6.1 Basic Definitions

This section provides the formal definitions used in the rest of the paper.

Program Points
An instruction I is a function application, I : τ₁ × … × τ_k → τ. While the type system does not preclude higher-order functions or function composition, it is important to note that most assembly languages (including x86) do not support these concepts. A program P is a sequence of instructions <I₁, …, I_N>. During program execution, the instructions are processed in the sequential order they appear in the program, with the exception of control-flow instructions that can change the sequential execution order. The index of the instruction in the program sequence is called a program point (or program counter), denoted by the function pc : {I₁, …, I_N} → [1, …, N], and defined as pc(I_j) = j, ∀ 1 ≤ j ≤ N. The set of all program points for program P is ProgramPoints(P) = {1, …, N}. The pc function provides a total ordering over the set of program instructions.

Control Flow Graph
A basic block B is a maximal sequence of instructions <I_l, …, I_m> that contains at most one control-flow instruction, which must appear at the end. Thus, the execution within a basic block is by definition sequential. Let V be the set of basic blocks for a program P, and E ⊆ V × V × { T, F } be the set of control flow transitions between basic blocks. Each edge is marked with either T or F corresponding to the condition (true or false) on which that edge is followed. Unconditional jumps have outgoing edges always marked with T. The directed graph CFG( P ) = <V, E > is called the control flow graph.

Predicates
Predicates are the mechanism by which we incorporate results of various static analyses such as live range and points-to analysis. These predicates can be used in the definition of abstraction patterns. Table 2 lists predicates that are currently available in our system. For example, code between two program points p₁ and p₂ can be verified as dead-code (Section 4.1.1) by checking that for every variable m that is live in the program range [p₁, p₂], its value at point p₂ is the same as its value at point p₁. The change in m's value between two program points p₁ and p₂ is denoted by Delta(m, p₁, p₂) and can be implemented using polyhedral analysis [24].

Dominators( B )

the set of basic blocks that dominate the basic block B

PostDominators( B )

the set of basic blocks that are dominated by the basic block B

Pred( B )

the set of basic blocks that immediately precede B

Succ( B )

the set of basic blocks that immediately follow B

First( B )

the first instruction of the basic block B

Last( B )

the last instruction of the basic block B

Previous( I )

∪_{B' ∈ Pred( B_I )} Last( B' )	if I = First( B_I )
-or-
I'	if B_I = < …, I', I, … >

Next( I )

∪_{B' ∈ Succ( B_I )} First( B' )	if I = Last( B_I )
-or-
I'	if B_I = < …, I, I', … >

Kills( p, a )

true if the instruction at program point p kills variable a

Uses( p, a )

true if the instruction at program point p uses variable a

Alias( p, x, y )

true if variable x is an alias for y at program point p

LiveRangeStart( p, a )

the set of program points that start the a's live range that includes p

LiveRangeEnd( p, a )

the set of program points that end the a's live range that includes p

Delta( p, m, n )

the difference between integer variables m and n at program point p

Delta( m, p_1, p_2 )

the change in m's value between program points p₁ and p₂

PointsTo( p, x, a )

true if variable x points to location of a at program point p

Table 2: Examples of static analysis predicates.

Explanations of the static analysis predicates shown in Table 2 are standard and can be found in a compiler textbook (such as [3]).

Instructions and Data Types
The type constructors build upon simple integer types (listed below as the ground class of types), and allow for array types (with two variations: the pointer-to-start-of-array type and the pointer-to-middle-of-array type), structures and unions, pointers, and functions. Two special types ⊥(n) and T(n) complete the type system lattice. ⊥(n) and T(n) represent types that are stored on n bits, with ⊥(n) being the least specific ("any") type and T(n) being the most specific type. Table 3 describes the constructors allowed in our type system.

τ	::	ground	Ground types
	\|	τ`[`n`]`	Pointer to the base of an array of type τ and of size n
	\|	τ`(`n`]`	Pointer into the middle of an array of type τ and of size n
	\|	τ `ptr`	Pointer to τ
	\|	s `{` μ₁, …, μ_k `}`	Structure (product of types of μ _i)
	\|	u`{`μ₁, …, μ_k`}`	Union
	\|	τ₁ × … × τ_k → τ	Function
	\|	T`(`n`)`	Top type of n bits
	\|	⊥`(`n`)`	Bottom type of n bits (type "any" of n bits)

μ	::	`(`l, τ i`)`	Member labeled l of type τ at offset i

ground	::	`int(`g`:`s`:`v`)` \| `uint(`g`:`s`:`v`)` \| …

Table 3: A simple type system.

The type μ(l, τ, i) represents the type of a field member of a structure. The field has a type τ (independent of the types of all other fields in the same structure), an offset i that uniquely determines the location of the field within the structure, and a label l that identifies the field within the structure (in some cases this label might be undefined).

Physical subtyping takes into account the layout of values in memory [4, 6]. If a type τ is a physical subtype of τ' (denoted it by τ ≤ τ'), then the memory layout of a value of type τ' is a prefix of the memory layout of a value of type τ. We will not describe the rules of physical subtyping here as we refer the reader to Xu's thesis [6] for a detailed account of the typestate system (including subtyping rules).

The type int(g:s:v) represents a signed integer, and it covers a wide variety of values within storage locations. It is parametrized using three parameters as follows: g represents the number of highest bits that are ignored, s is the number of middle bits that represent the sign, and v is the number of lowest bits that represent the value. Thus the type int(g:s:v) uses a total of g + s + v bits.

The type uint(g:s:v) represents an unsigned integer, and it is just a variation of int(g:s:v), with the middle s sign bits always set to zero.

The notation int(g:s:v) allows for the separation of the data and storage location type. In most assembly languages, it is possible to use a storage location larger than that required by the data type stored in it. For example, if a byte is stored right-aligned in a (32-bit) word, its associated type is int(24:1:7). This means that an instruction such as xor on least significant byte within 32-bit word will preserve the leftmost 24 bits of the 32-bit word, even though the instruction addresses the memory on 32-bit word boundary.

This separation between data and storage location raises the issue of alignment information, i.e., most computer systems require or prefer data to be at a memory address aligned to the data size. For example, 32-bit integers should be aligned on 4-byte boundaries, with the drawback that accessing an unaligned 32-bit integer leads to either a slowdown (due to several aligned memory accesses) or an exception that requires handling in software. Presently, we do not use alignment information as it does not seem to provide a significant covert way of changing the program flow.

Code		Type
call 0h
pop ebx		ebx : ⊥(32)
lea ecx, [ebx + 42h]		ecx : ⊥(32),
		ebx : ptr ⊥(32)
push ecx		ecx : ⊥(32)
push eax		eax : ⊥(32)
push eax		eax : ⊥(32)
sidt [esp - 02h]
pop ebx		eax : ⊥(32)
add ebx, 1Ch		ebx : int(0:1:31
cli
mov ebp, [ebx]		ebp : ⊥(32),
		ebx : ptr ⊥(32)

Figure 9: Inferred types from Chernobyl/CIH virus code.

Figure 9 shows the types for operands in a section of code from the Chernobyl/CIH virus. Table 4 illustrates the type system for Intel IA-32 architecture. There are other IA-32 data types that are not covered in Table 4, including bit strings, byte strings, 64- and 128-bit packed SIMD types, and BCD and packed BCD formats. The IA-32 logical address is a combination of a 16-bit segment selector and a 32-bit segment offset, thus its type is the cross product of a 16-bit unsigned integer and a 32-bit pointer.

IA-32 Datatype	Type Expression

byte unsigned int	uint(0:0:8)
word unsigned int	uint(0:0:16)
doubleword unsigned int	uint(0:0:32)
quadword unsigned int	uint(0:0:64)
double quadword unsigned int	uint(0:0:128)

byte signed int	int(0:1:7)
word signed int	int(0:1:15)
doubleword signed int	int(0:1:31)
quadword signed int	int(0:1:63)
double quadword signed int	int(0:1:127)

single precision float	float(0:1:31)
double precision float	float(0:1:63)
double extended precision float	float(0:1:79)

near pointer	⊥(32)
far pointer (logical address)	uint(0:0:16) × uint(0:0:32) → ⊥(48)

eax, ebx, ecx, edx	⊥(32)
esi, edi, ebp, esp	⊥(32)
eip	int(0:1:31)
cs, ds, ss, es, fs, gs	⊥(16)
ax, bx, cx, dx	⊥(16)
al, bl, cl, dl	⊥(8)
ah, bh, ch, dh	⊥(8)

Table 4: IA-32 datatypes and their corresponding expression in the type system from Table 3.

6.2 Abstraction Patterns

An abstraction pattern Γ is a 3-tuple (V, O, C), where V is a list of typed variables, O is a sequence of instructions, and C is a boolean expression combining one or more static analysis predicates over program points. Formally, a pattern Γ=(V, O, C) is a 3-tuple defined as follows:

V	=	{x₁ : τ₁, …, x_k : τ_k}
O	=	<I(v₁, …, v_m) \| I : τ₁ × … × τ_m → τ>
C	=	boolean expression involving static analysis predicates and logical operators

An instruction from the sequence O has a number of arguments (v_i)_i≥0, where each argument is either a literal value or a free variable x_j. We write Γ(x₁ : τ₁, …, x_k : τ_k) to denote the pattern Γ = (V, O, C) with free variables x₁, …, x_k. An example of a pattern is shown below.

This pattern represents two instructions that pop a register X off the stack and then add a constant value to it (0x03AF). Note the use of uninterpreted symbol X in the pattern. Use of the uninterpreted symbols in a pattern allows it to match multiple sequences of instructions, e.g., the patterns shown above matches any instantiation of the pattern where X is assigned a specific register. The type int(0:1:31) of X represents an integer with 31 bits of storage and one sign bit.

We define a binding B as a set of pairs [variable v, value x]. Formally, a binding B is defined as { [x,v] | x ∈ V, x : τ, v : τ', τ ≤ τ' }. If a pair [x, v] occurs in a binding B, then we write B(x) = v. Two bindings B₁ and B₂ are said to be compatible if they do not bind the same variable to different values:

Compatible(B₁, B₂) = ∀ x ∈ V . ( [x, y₁] ∈ B₁ ∧ [x, y₂] ∈ B₂ ) ⇒ ( y₁ = y₂ )

The union of two compatible bindings B₁ and B₂ includes all the pairs from both bindings. For incompatible bindings, the union operation returns an empty binding.

When matching an abstraction pattern against a sequence of instructions, we use unification to bind the free variables of Γ to actual values. The function:

Unify( <…, op_i( x_i,1, …, x_{i,n_i} ), …>_{1
≤ i ≤ m}, Γ )
returns a ``most general'' binding B if the instruction sequence <…, op_i( x_i,1, …, x_{i,n_i} ), …>_{1 ≤ i ≤
m} can be unified with the sequence of instructions O specified in the pattern Γ. If the two instruction sequences cannot be unified, Unify returns false. Definitions and algorithms related to unification are standard and can be found in [50].³

6.3 Annotator Operation

The annotator associates a set of matching patterns with each node in the CFG. The annotated CFG of a program procedure P with respect to a set of patterns Σ is denoted by P_Σ. Assume that a node n in the CFG corresponds to the program point p and the instruction at p is I_p. The annotator attempts to match the (possibly interprocedural) instruction sequence S(n) = <…, Previous²(I_p), Previous(I_p), I_p> with the patterns in the set Σ = {Γ₁, …, Γ_m}. The CFG node n is then labeled with the list of pairs of patterns and bindings that satisfy the following condition:

Annotation(n) = { [Γ, B] : Γ ∈ {Γ₁, …, Γ_m} ∧ B = Unify( S(n), Γ ) }

If Unify( S(n), Γ ) returns false (because unification is not possible), then the node n is not annotated with [Γ, B]. Note that a pattern Γ might appear several times (albeit with different bindings) in Annotation(n). However, the pair [Γ, B] is unique in the annotation set of a given node.

7 Detector

The detector takes as its inputs an annotated CFG for an executable program procedure and a malicious code automaton. If the malicious pattern described by the malicious code automaton is also found in the annotated CFG, the detector returns the sequence of instructions exhibiting the pattern. The detector returns no if the malicious pattern cannot be found in the annotated CFG.

7.1 The Malicious-Code Automaton

Intuitively, the malicious code automaton is a generalization of the vanilla virus, i.e., the malicious code automaton also represents obfuscated strains of the virus. Formally, a malicious code automaton (or MCA) A is a 6-tuple (V, Σ, S, δ, S₀, F), where

V = { v₁ : τ₁ , …, v_k : τ_k } is a set of typed variables,
Σ = {Γ₁, …, Γ_n} is a finite alphabet of patterns parametrized by variables from V, for 1 ≤ i ≤ n, P_i = (V_i, O_i, C_i) where V_i ⊆ V,
S is a finite set of states,
δ : S × Σ → 2^S is a transition function,
S₀ ⊆ S is a non-empty set of initial states,
F ⊆ S is a non-empty set of final states.

An MCA is a generalization of an ordinary finite-state automaton in which the alphabets are a finite set of patterns defined over a set of typed variables. Given a binding B for the variables V={v₁, …, v_k }, the finite-state automaton obtained by substituting B(v_i) for v_i for all 1 ≤ i ≤ k in A is denoted by B(A). Note that B(A) is a "vanilla" finite-state automaton. We explain this using an example. Consider the MCA A shown in Figure 10 with V={A, B, C, D}. The automata obtained from A corresponding to the bindings B₁ and B₂ are shown in Figure 10. The uninterpreted variables in the MCA were introduced to handle obfuscation transformations based on register reassignment. The malicious code automaton corresponding to the code fragment shown in Figure 6 (from the Chernobyl virus) is depicted in Figure 11.

mov esi, ecx

mov eax, 0d601h

pop edx

pop ecx

B₁ = { [A, esi],

[B, ecx],

[C, eax],

[D, edx] }

mov esi, eax

mov ebx, 0d601h

pop ecx

pop eax

B₂ = { [A, esi],

[B, eax],

[C, ebx],

[D, ecx] }

Figure 10: Malicious code automaton for a Chernobyl virus code fragment, and instantiations with different register assignments, shown with their respective bindings.

Figure 11: Malicious code automaton corresponding to code fragment from Figure 6.

7.2 Detector Operation

The detector takes as its inputs the annotated CFG P_Σ of a program procedure P and a malicious code automaton MCA A=(V, Σ, S,δ, S₀, F). Note that the set of patterns Σ is used both to construct the annotated CFG and as the alphabet of the malicious code automaton. Intuitively, the detector determines whether there exists a malicious pattern that occurs in A and P_Σ. We formalize this intuitive notion. The annotated CFG P_Σ is a finite-state automaton where nodes are states, edges represent transitions, the node corresponding to the entry point is the initial state, and every node is a final state. Our detector determines whether the following language is empty:

In the expression given above, L(P_Σ) is the language corresponding to the annotated CFG and B_All is the set of all bindings to the variables in the set V. In other words, the detector determines whether there exists a binding B such that the intersection of the languages P_Σ and B(A) is non-empty.

Our detection algorithm is very similar to the classic algorithm for determining whether the intersection of two regular languages is non-empty [51]. However, due to the presence of variables, we must perform unification during the algorithm. Our algorithm (Figure 12) combines the classic algorithm for computing the intersection of two regular languages with unification. We have implemented the algorithm as a data-flow analysis.

For each node n of the annotated CFG P_A we associate pre and post lists L_n^pre and L_n^post respectively. Each element of a list is a pair [s,B], where s is the state of the MCA A and B is the binding of variables. Intuitively, if [s,B] ∈ L_n^pre, then it is possible for A with the binding B (i.e. for B(A)) to be in state s just before node n.
Initial condition: Initially, both lists associated with all nodes except the start node n₀ are empty. The pre list associated with the start node is the list of all pairs [s, ∅], where s is an initial state of the MCA A, and the post list associated with the start node is empty.
The do-until loop: The do-until loop updates the pre and post lists of all the nodes. At the end of the loop, the worklist WS contains the set of nodes whose pre or post information has changed. The loop executes until the pre and post information associated with the nodes does not change, and a fixed point is reached. The join operation that computes L_i^pre takes the list of state-binding pairs from all of the L_j^post sets for program points preceding i and copies them to L_i^pre only if there are no repeated states. In case of repeated states, the conflicting pairs are merged into a single pair only if the bindings are compatible. If the bindings are incompatible, both pairs are thrown out.
Diagnostic feedback: Suppose our algorithm returns a non-empty set, meaning a malicious pattern is common to the annotated CFG P_Σ and MCA A. In this case, we return the sequence of instructions in the executable corresponding to the malicious pattern. This is achieved by keeping an additional structure with the algorithm. Every time the post list for a node n is updated by taking a transition in A (see the statement 14 in Figure 12), we store the predecessor of the added state, i.e., if [δ(s, Γ), B_s ∪ B] is added to L_n^post, then we add an edge from s to δ(s, Γ) (along with the binding B_s ∪ B) in the associated structure. Suppose we detect that L_n^post contains a state [s, B_s], where s is a final state of the MCA A. Then we traceback the associated structure from s until we reach an initial state of A (storing the instructions occurring along the way).

Input: A list of patterns Σ = {P₁, …, P_r}, a malicious code automaton A = (V, Σ, S, δ, S₀, F), and an annotated CFG P_Σ = <N, E>.
Output: true if the program is likely infected, false otherwise.

MaliciousCodeChecking(Σ, A, P_Σ)

(1)	L^pre_n₀ ← { [s, ∅] \| s ∈ S₀ }, where n₀ ∈ N is the entry node of P_Σ
(2)	foreach n ∈ N \ { n₀ } do L_n^pre ← ∅
(3)	foreach n ∈ N do L_n^post ← ∅
(4)	WS ← ∅
(5)	do
(6)	WS_old ← WS
(7)	WS ← ∅
(8)	foreach n ∈ N	// update pre information
(9)	if L_n^pre ≠ ∪_{m ∈ Previous(n)} L_m^post
(10)	L_n^pre ← ∪_{m ∈ Previous(n)} L_m^post
(11)	WS ← WS ∪ { n }
(12)	foreach n ∈ N	// update post information
(13)	NewL_n^post ← ∅
(14)	foreach [s, B_s] ∈ L_n^pre
(15)	foreach [Γ, B] ∈ Annotation(n) ∧ Compatible(B_s, B)	// follow a transition
(16)	add [δ( s, Γ ), B_s ∪ B] to NewL_n^post
(17)	if L_n^post ≠ NewL_n^post
(18)	L_n^post ← NewL_n^post
(19)	WS ← WS ∪ { n }
(20)	until WS = ∅
(21)	return ∃ n ∈ N . ∃ [s, B_s] ∈ L_n^post . s ∈ F

Figure 12: Algorithm to check a program model against a malicious code specification.

8 Experimental Data

The three major goals of our experiments were to measure the execution time of our tool and find the false positive and negative rates. We constructed ten obfuscated versions of the four viruses. Let V_{i, k} (for 1 ≤ i ≤ 4 and 1 ≤ k ≤ 10) denote the k-th version of the i-th virus. The obfuscated versions were created by varying the obfuscation parameters, e.g., number of nops and inserted jumps. For the i-th virus, V_{i, 1} denoted the "vanilla" or the unobfuscated version of the virus. Let M₁, M₂, M₃ and M₄ be the malicious code automata corresponding to the four viruses.

8.1 Testing Environment

The testing environment consisted of a Microsoft Windows 2000 machine. The hardware configuration included an AMD Athlon 1 GHz processor and 1 GB of RAM. We used CodeSurfer version 1.5 patchlevel 0 and IDA Pro version 4.1.7.600.

8.2 Testing on Malicious Code

We will describe the testing with respect to the first virus. The testing for the other viruses is analogous. First, we ran SAFE on the 10 versions of the first virus V_1,1, …, V_1,10 with malicious code automaton M₁. This experiment gave us the false negative rate, i.e., the pattern corresponding to M₁ should be detected in all versions of the virus.

	Annotator		Detector
	avg.	(std. dev.)	avg.	(std. dev.)
Chernobyl	1.444 s	(0.497 s)	0.535 s	(0.043 s)
z0mbie-6.b	4.600 s	(2.059 s)	1.149 s	(0.041 s)
f0sf0r0	4.900 s	(2.844 s)	0.923 s	(0.192 s)
Hare	9.142 s	(1.551 s)	1.604 s	(0.104 s)

Table 5: SAFE performance when checking obfuscated viruses for false negatives.

Next, we executed SAFE on the versions of the viruses V_i,k with the malicious code automaton M_j (where i ≠ j). This helped us find the false positive rate of SAFE.

In our experiments, we found that SAFE's false positive and negative rate were 0. We also measured the execution times for each run. Since IDA Pro and CodeSurfer were not implemented by us, we did not measure the execution times for these components. We report the average and standard deviation of the execution times in Tables 5 and 6.

	Annotator		Detector
	avg.	(std. dev.)	avg.	(std. dev.)
z0mbie-6.b	3.400 s	(1.428 s)	1.400 s	(0.420 s)
f0sf0r0	4.900 s	(1.136 s)	0.840 s	(0.082 s)
Hare	1.000 s	(0.000 s)	0.220 s	(0.019 s)

Table 6: SAFE performance when checking obfuscated viruses for false positives against the Chernobyl/CIH virus.

8.3 Testing on Benign Code

We considered a suite of benign programs (see Section 8.3.1 for descriptions). For each benign program, we executed SAFE on the malicious code automaton corresponding to the four viruses. Our detector reported "negative" in each case, i.e., the false positive rate is 0. The average and variance of the execution times are reported in Table 7. As can be seen from the results, for certain cases the execution times are unacceptably large. We will address performance enhancements to SAFE in the future.

8.3.1 Descriptions of the Benign Executables

tiffdither.exe is a command line utility in the \textit{cygwin} toolkit version 1.3.70, a UNIX environment for Windows developed by Red Hat.
winmine.exe is the Microsoft Windows 2000 Minesweeper game, version 5.0.2135.1.
spyxx.exe is a Microsoft Visual Studio 6.0 Spy++ utility, that allows the querying of properties and monitoring of messages of Windows applications. The executable we tested was marked as version 6.0.8168.0.
QuickTimePlayer.exe is part of the Apple QuickTime media player, version 5.0.2.15.

	Executable size	`.text` size	Procedure count	Annotator		Detector
	Executable size	`.text` size	Procedure count	avg.	(std. dev.)	avg.	(std. dev.)
tiffdither.exe	9,216 B	6,656 B	29	6.333 s	(0.471 s)	1.030 s	(0.043 s)
winmine.exe	96,528 B	12,120 B	85	15.667 s	(1.700 s)	2.283 s	(0.131 s)
spyxx.exe	499,768 B	307,200 B	1,765	193.667 s	(11.557 s)	30.917 s	(6.625 s)
QuickTimePlayer.exe	1,043,968 B	499,712 B	4,767	799.333 s	(5.437 s)	160.580 s	(4.455 s)

Table 7: SAFE performance in seconds when checking clean programs against the Chernobyl/CIH virus.

9 Conclusion and Future Work

We presented a unique view of malicious code detection as a obfuscation-deobfuscation game. We used this viewpoint to explore obfuscation attacks on commercial virus scanners, and found that three popular virus scanners were susceptible to these attacks. We presented a static analysis framework for detecting malicious code patterns in executables. Based upon our framework, we have implemented SAFE, a static analyzer for executables that detects malicious patterns in executables and is resilient to common obfuscation transformations.

For future work, we will investigate the use of theorem provers during the construction of the annotated CFG. For instance, SLAM [41] uses the theorem prover Simplify [49] for predicate abstraction of C programs. Our detection algorithm is context insensitive and does not track the calling context of the executable. We will investigate the use of push-down systems, which would make our algorithm context sensitive. However, the existing PDS formalism does not allow uninterpreted variables, so it will have to be extended to be used in our context.

Availability
The SAFE prototype remains in development and we are not distributing it at this time. Please contact Mihai Christodorescu, <mihai@cs.wisc.edu>, for further updates.

Acknowledgments
We would like to thank Thomas Reps and Jonathon Giffin for providing us with invaluable comments on earlier drafts of the paper. We would also like to thank the members and collaborators of the Wisconsin Safety Analyzer (WiSA) research group for their insightful feedback and support throughout the development of this work.

Bibliography

[1]: S. Staniford, V. Paxson, and N. Weaver. How to 0wn the internet in your spare time. In Proceedings of the 11th USENIX Security Symposium (Security '02), pages 149 - 167. USENIX, August 2002.
[2]: CERT Coordination Center. Denial of service attacks, 2001.
[ .html ]
[3]: S.S. Muchnick. Advanced Compiler Design and Implementation. Morgan Kaufmann, 1997.
[4]: S. Chandra and T.W. Reps. Physical type checking for C. In Workshop on Program Analysis For Software Tools and Engineering, pages 66 - 75. ACM SIGPLAN - SIGSOFT, September 1999.
[5]: R. Wang. Flash in the pan? Virus Bulletin, July 1998. Virus Analysis Library.
[6]: Z. Xu. Safety-Checking of Machine Code. PhD thesis, University of Wisconsin, Madison, 2000.
[7]: z0mbie. z0mbie's homepage.
[ http ]
[8]: z0mbie. RPME mutation engine.
[ http ]
[9]: z0mbie. Automated reverse engineering: Mistfall engine.
[ .txt ]
[10]: TESO. burneye elf encryption program.
[ http ]
[11]: P. Sz�r and P. Ferrie. Hunting for metamorphic. In Proceedings of Virus Bulletin Conference, pages 123 - 144, September 2001.
[12]: C. Collberg, C. Thomborson, and D. Low. A taxonomy of obfuscating transformations. Technical Report 148, Department of Computer Sciences, The University of Auckland, July 1997.
[13]: F. Cohen. Computer viruses: Theory and experiments. Computers and Security, 6:22 - 35, 1987.
[14]: D.M. Chess and S.R. White. An undetectable computer virus. In Proceedings of Virus Bulletin Conference, 2000.
[15]: E. Kaspersky. Viruses Without an Entry Point, volume Virus List Encyclopaedia, chapter Ways of Infection. Kaspersky Labs, 2002.
[ http ]
[16]: G. McGraw and G. Morrisett. Attacking malicious code: Report to the infosec research council. IEEE Software, 17(5):33 - 41, September/October 2000.
[17]: C. Nachenberg. Polymorphic virus detection module. United States Patent # 5,696,822, December 9, 1997.
[18]: C. Nachenberg. Polymorphic virus detection module. United States Patent # 5,826,013, October 20, 1998.
[19]: GrammaTech Inc. Codesurfer - code analysis and understanding tool.
[ .html ]
[20]: DataRescue sa/nv. Ida pro - interactive disassembler.
[ http ]
[21]: Kaspersky Labs.
[ http ]
[22]: M. Samamura. W95.CIH, volume Expanded Threat List and Virus Encyclopaedia. Symantec Antivirus Research Center, 1998.
[ .html ]
[23]: C. Collberg, C. Thomborson, and D. Low. Manufacturing cheap, resilient, and stealthy opaque constructs. In Principles of Programming Languages 1998, POPL'98, San Diego, CA, January 1998.
[24]: P. Cousot and N. Halbwachs. Automatic discovery of linear restraints among variables of a program. In Proceedings of 5th ACM Symposium on Principles of Programming Languages, POPL '78, Tucson, AZ, January 1978.
[25]: B. Barak, O. Goldreich, R. Impagliazzo, S. Rudich, A. Sahai, S. Vadhan, and K. Yang. On the (im)possibility of obfuscating programs. In Advances in Cryptology - CRYPTO `01, volume 2139 of Lecture Notes in Computer Science, pages 1 - 18, Santa Barbara, CA, August 2001. Springer-Verlag.
[26]: G. Morrisett, D. Walker, K. Crary, and N. Glew. From System F to Typed Assembly Language. In Twenty-Fifth ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pages 85 - 97, San Diego, CA, January 1998.
[27]: G. Morrisett, K. Crary, N. Glew, and D. Walker. Stack-Based Typed Assembly Language. In Xavier Leroy and Atsushi Ohori, editors, 1998 Workshop on Types in Compilation, volume 1473 of Lecture Notes in Computer Science, pages 28 - 52, Kyoto, Japan, March 1998. Springer-Verlag.
[28]: D. Wagner and D. Dean. Intrusion detection via static analysis. In 2001 IEEE Symposium on Security and Privacy, Oakland, CA, May 2001.
[29]: J.T. Giffin, S. Jha, and B.P. Miller. Detecting manipulated remote call streams. In 11th USENIX Security Symposium, San Francisco, CA, August 2002.
[30]: E.M. Myers. A precise interprocedural data flow algorithm. In Conference Record of the 8th Annual ACM Symposium on Principles of Programming Languages (POPL), pages 219 - 230, Williamsburg, VA, January 1981.
[31]: W. Landi. Undecidability of static analysis. ACM Letters on Programming Languages and Systems, 1(4):323 - 337, December 1992.
[32]: D. Moore, V. Paxson, S. Savage, C. Shannon, S. Staniford, and N. Weaver. The spread of the Sapphire/Slammer worm.
[ .html ]
[33]: David W. Currie, Alan J. Hu, and Sreeranga Rajan. Automatic formal verification of dsp software. In Proceedings of the 37th conference on Design automation, pages 130-135. ACM Press, 2000.
[34]: Xiushan Feng and Alan J. Hu. Automatic formal verification for scheduled vliw code. In Proceedings of the joint conference on Languages, compilers and tools for embedded systems, pages 85-92. ACM Press, 2002.
[35]: George C. Necula. Translation validation for an optimizing compiler. In Proceedings of the ACM SIGPLAN '00 conference on Programming language design and implementation, pages 83-94. ACM Press, 2000.
[36]: S. Horwitz, T. Reps, and D. Binkley. Interprocedural slicing using dependence graphs. ACM Transactions on Programming Languages and Systems (TOPLAS), 12(1):26-60, January 1990.
[37]: T. Reps, S. Horwitz, and M. Sagiv. Precise interprocedural dataflow analysis via graph reachability. In Symposium on Principles of Programming Languages, pages 49-61, New York, NY, 1995. ACM Press.
[38]: J. Esparza, D. Hansel, P. Rossmanith, and S. Schwoon. Efficient algorithms for model checking pushdown systems. In Proceedings of Computer-Aided Verification, volume 1855 of Lecture Notes in Computer Science, pages 232-247, July 2000.
[39]: H. Chen and D. Wagner. MOPS: an infrastructure for examining security properties of software. In 9th ACM Conference on Computer and Communications Security (CCS), November 2002.
[40]: T. Jensen, D.L. Metayer, and T. Thorn. Verification of control flow based security properties. In 1999 IEEE Symposium on Security and Privacy, May 1999.
[41]: T. Ball and S.K. Rajamani. Automatically validating temporal safety properties of interfaces. In SPIN 2001, Workshop on Model Checking of Software, 2001.
[42]: K. Ashcraft and D. Engler. Using programmer-written compiler extensions to catch security holes. In 2002 IEEE Symposium on Security and Privacy, pages 143-159, May 2002.
[43]: U. Erlingsson and F. B. Schneider. IRM enforcement of Java stack inspection. In 2000 IEEE Symposium on Security and Privacy, pages 246-255, May 2000.
[44]: B.V. Chess. Improving computer security using extending static checking. In 2002 IEEE Symposium on Security and Privacy, pages 160-173, May 2002.
[45]: J. Corbett, M. Dwyer, J. Hatcliff, C. Pasareanu, Robby, S. Laubach, and H. Zheng. Bandera: Extracting finite-state models from Java source code. In ICSE 2000: International Conference on Software Engineering, pages 439-448, 2000.
[46]: S. Owre, S. Rajan, J. Rushby, N. Shankar, and M. Srivas. PVS: Combining specification, proof checking, and model checking. In CAV 96: Computer-Aided Verification, pages 411-414, 1996.
[47]: M. Bishop and M. Dilger. Checking for race conditions in file accesses. Computing Systems, 9(2), 1996.
[48]: R.W. Lo, K.N. Levitt, and R.A. Olsson. MCF: A malicious code filter. Computers & Society, 14(6):541-566, 1995.
[49]: D. Detlefs, G. Nelson, and J. Saxe. Simplify theorem prover.
[ .html ]
[50]: M. Fitting. First-Order Logic and Automated Theorem Proving. Springer-Verlag, 1996.
[51]: J.E. Hopcroft, R. Motwani, and J.D. Ullman. Introduction to Automata Theory, Languages, and Computation. Addison Wesley, 2001.

Notes:

Note that the subroutine address computation had to be updated to take into account the new nops. This is a trivial computation and can be implemented by adding the number of inserted nops to the initial offset hard-coded in the virus-morphing code.
Most executable formats require that the various sections of the executable file start at certain aligned addresses, to respect the target platform's idiosyncrasies. The extra space between the end of one section and the beginning of the next is usually padded with nulls.
We use one-way matching which is simpler than full unification. Note that the instruction sequence does not contain any variables. We instantiate variables in the pattern so that they match the corresponding terms in the instruction sequence.

Mihai Christodorescu <mihai@cs.wisc.edu>

This paper can be found online at http://www.cs.wisc.edu/~mihai/my_work/papers/index.html#11.
This paper was converted from LaTeX using bibtex2html 1.57 and LaTeX2HTML 1.67.
Last modified: Mon May 12 18:55:12 CDT 2003

`E8 00000000`		`call 0h`
`5B`		`pop ebx`
`8D 4B 42`		`lea ecx, [ebx + 42h]`
`51`		`push ecx`
`50`		`push eax`
`50`		`push eax`
`0F01 4C 24 FE`		`sidt [esp - 02h]`
`5B`		`pop ebx`
`83 C3 1C`		`add ebx, 1Ch`
`FA`		`cli`
`8B 2B`		`mov ebp, [ebx]`

Static Analysis of Executables to Detect Malicious Patterns*