SourcererCC: Scaling Type-3 Clone Detection to Large Software Repositories

Project Dates: 
January 2014
Project Description: 

Given the availability of large-scale source-code repositories, there have been a large number of applications for clone detection. Unfortunately, despite a decade of active research, there is a marked lack in clone detectors that scale to large software repositories. In particular for detecting near-miss clones where significant editing activities may take place in the cloned code.

We developed SourcererCC, a token-based clone detector that targets the first three clone types, and exploits an index to achieve scalability to large inter-project repositories using a standard workstation. SourcererCC uses an optimized inverted-index to quickly query the potential clones of a given code block. Filtering heuristics based on token ordering are used to significantly reduce the size of the index, the number of code-block comparisons needed to detect the clones, as well as the number of required token-comparisons needed to judge a potential clone.

We evaluated the scalability, execution time, recall and precision of SourcererCC, and compare it to four publicly available and state-of-the-art tools. To measure recall, we use two recent benchmarks, (1) an exhaustive benchmark of real clones, BigCloneBench, and (2) a Mutation/Injection-based framework of thousands of fine-grained artificial clones. SourcererCC has both high recall and precision, and is able to scale to a large inter-project repository (250MLOC) using a standard workstation.

This work builds on earlier Token-Based Code Clone Detection research.

Saini, V., H. Sajnani, J. Kim, and C. Lopes, "SourcererCC and SourcererCC-I: Tools to Detect Clones in Batch mode and During Software Development", 38th International Conference on Software Engineering, Companion Proceedings, Austin, TX, ACM, pp. 597-600, 2016.
Sajnani, H., V. Saini, J. Svajlenko, C. K.. Roy, and C. V. Lopes, "SourcererCC: Scaling Code Clone Detection to Big Code", 38th International Conference on Software Engineering (ICSE 2016), Austin, TX, ACM, pp. 1157-1168 , 05/2016 .
Saini, V., H. Sajnani, and C. Lopes, "Comparing Quality Metrics for Cloned and Non-Cloned Java Methods: A Large Scale Empirical Study", IEEE 32nd International Conference on Software Maintenance and Evolution (ICSME 2016), Raleigh, North Carolina, USA, IEEE, October, 2016.
Sajnani, H., "Large-Scale Code Clone Detection", Doctoral Dissertation: University of California, Irvine, 2016.
Sajnani, H., V. Saini, and C. Videira Lopes, "A Comparative Study of Bug Patterns in Java Cloned and Non-cloned Code", IEEE 14th International Working Conference on Source Code Analysis and Manipulation (SCAM), Victoria, BC, Canada, pp. 21-30, September, 2014.