Software Analytics at Scale

The DARPA Mining and Understanding Software Enclaves (MUSE) program seeks to make significant advances in the way software is built, debugged, verified, maintained, and understood.  Central to its approach is the creation of a community infrastructure built around a large, diverse, and evolving corpus of software drawn from the hundreds of billions of lines of open source code available today.

Professor Crista LopesAn integral part of the envisioned infrastructure is a continuously operational specification mining engine.  This engine will leverage deep program analyses and foundational ideas underlying big data analytics to populate and refine a database containing inferences about useful properties, behaviors and vulnerabilities of the program components in the corpus.  The collective knowledge gleaned from this effort would facilitate new mechanisms for dramatically improving software reliability, and help develop radically different approaches for automatically constructing and repairing complex software.

Central to this vision is the very important step of understanding the nature of these very large bodies of source code.  In machine learning, the results are only as good as the data that is fed into the system.  So, what do these repositories consist of?  Are all projects “good data”?  This is where Associate Director Prof. Crista Lopes comes in.  Her team is performing software analytics at scale in order to better understand how these very large repositories of open source code can best be used.

For example, in analyzing the existing MUSE corpus, which includes 150,000+ Java projects and 70,000+ C/C++ projects from various origins, Lopes and her team have discovered a large amount of code duplication.  Just within the non-fork projects coming from Github, they found the following:

  • out of 2.6M Java files, 62% of them have an exact- or near-duplicate within that subset of projects;
  • out of 14.5M C/C++ files, 90% of them have an exact- or near-duplicate within that subset of projects;
  • out of 18,000 Java projects, the entire source code of 9% of them can be found in other projects;
  • out of 42,000 C/C++ projects, the entire source code of 11% of them can be found in other projects.

When analyzing the entire corpus, which contains projects from many origins, these figures are even higher, because many open source projects have replicas in different software repositories.  Data mining and machine learning efforts need to be informed about this duplication and avoid it, or target it, depending on what they are trying to achieve.

Clone detection at scale is performed using a special clone detection tool developed by Lopes’ team, SourcererCC, which is publicly available (  The tool is currently being improved and expanded in order to be able to work on code written in any programming language.  Besides its use in the MUSE program, SourcererCC can also be used to detect plagiarism and license violations.

In addition to clone detection, the buildability of projects is another important characteristic for software analytics: projects that build are usually more valuable for the ability to learn from them than projects that fail to build.  Prof. Lopes and her team are leading the way in devising heuristics for automatically building projects at scale, without manual intervention.  Out of the 134,000+ non-empty Java projects, they were able to successfully resolve dependencies and build 31% of them.  This ongoing effort is expected to increase the build rate over the next two years.  Similarly, finding projects with test cases is also important for many uses of large corpora.

Prof. Lopes MUSE team includes post-doctoral researcher Pedro Martins, a recent Ph.D. graduate Hitesh Sajnani (now at Microsoft Research) and graduate students Rohan Achar, Vaibhav Saini, and Di Yang.

More about Prof. Lopes and her team can be found at

This article appeared in ISR Connector issue: