Sourcerer is an ongoing research project at the University of California, Irvine aimed at exploring open source projects through the use of code analysis. The existence of an extremely large body of open source code presents a tremendous opportunity for software engineering research. Not only do we leverage this code for our own research, but we provide the open source Sourcerer Infrastructure and curated datasets for other researchers to use.
The Sourcerer Infrastructure is composed of a number of layers.
- At the lowest layer, the core infrastructure is a set of Java tools for crawling, downloading, processing, and indexing open source Java projects. It is available on GitHub under a GNU GPL license.
- While the core infrastructure allows one to automatically crawl and download open source projects, we also provide our current repository to anyone interested.
- Once Sourcerer's repository is constructed, we populate Sourcerer DB, a relational database, with structure and reference information extracted from the project source code. The services that the Sourcerer Infrastructure provides, including the code search service, are all built on top of Sourcerer DB. We provide direct read-only access to Sourcerer DB to those that are interested.
- The Sourcerer Infrastructure provides a number of higher-level services that researchers or application designers can leverage to create rich next-generation search applications. These services include repository exploration, code search and dependency slicing.
- The first application we built using the Sourcerer Infrastructure was a code search engine.
Complete information, including Datasets, Tutorials, and a comprehensive Publication List are availble on The Sourcerer Project website.
This work has received a Best Paper Award at Mining Software Repositories 2009 for “Mining Search Topics From A Code Search Engine Usage Logs” by alumnus Sushil Krishna Bajracharya and Prof. Cristina Videira Lopes.