Code Vectors: Understanding Programs Through Embedded Abstracted Symbolic Traces
Jordan Henkel, Shuvendu K. Lahiri, Ben Liblit, and Thomas Reps
With the rise of machine learning, there is a great deal of interest in
treating programs as data to be fed to learning algorithms.
However, programs do not start off in a form that is immediately
amenable to most off-the-shelf learning techniques.
Instead, it is necessary to transform the program to a suitable
representation before a learning technique can be applied.
In this paper, we use abstractions of traces obtained from symbolic
execution of a program as a representation for learning word
embeddings.
We trained a variety of word embeddings under hundreds of
parameterizations, and evaluated each learned embedding on a suite of
different tasks.
In our evaluation, we obtain 93% top-1 accuracy on a benchmark
consisting of over 19,000 API-usage analogies extracted from the Linux
kernel.
In addition, we show that embeddings learned from (mainly) semantic
abstractions provide nearly triple the accuracy of those learned from
(mainly) syntactic abstractions.
(Click here to access the paper:
PDF.]