Detecting and Measuring Similarity in Code Clones

Randy Smith and Susan Horwitz

Most previous work on code-clone detection has focused on finding identical clones, or clones that are identical up to identifiers and literal values. However, it is often important to find similar clones, too. One challenge is that the definition of similarity depends on the context in which clones are being found. Therefore, we propose new techniques for finding similar code blocks and for quantifying their similarity. Our techniques can be used to find clone clusters, sets of code blocks all within a user-supplied similarity threshold of each other. Also, given one code block, we can find all similar blocks and present them rank-ordered by similarity. Our techniques have been used in a clone-detection tool for C programs. The ideas could also be incorporated in many existing clone-detection tools to provide more flexibility in their definitions of similar clones.