The Sourcerer Project

Project Dates: 
January 2006

Significant funding has been received from the National Science Foundation and the DARPA MUSE program.

Project Description: 

Sourcerer is an ongoing research project at the University of California, Irvine aimed at exploring open source projects through the use of code analysis. The existence of an extremely large body of open source code presents a tremendous opportunity for software engineering research. Not only do we leverage this code for our own research, but we provide the open source Sourcerer Infrastructure and curated datasets for other researchers to use.

The Sourcerer Infrastructure is composed of a number of layers.

  • At the lowest layer, the core infrastructure is a set of Java tools for crawling, downloading, processing, and indexing open source Java projects. It is available on GitHub under a GNU GPL license. 
  • While the core infrastructure allows one to automatically crawl and download open source projects, we also provide our current repository to anyone interested. 
  • Once Sourcerer's repository is constructed, we populate Sourcerer DB, a relational database, with structure and reference information extracted from the project source code. The services that the Sourcerer Infrastructure provides, including the code search service, are all built on top of Sourcerer DB. We provide direct read-only access to Sourcerer DB to those that are interested. 
  • The Sourcerer Infrastructure provides a number of higher-level services that researchers or application designers can leverage to create rich next-generation search applications. These services include repository exploration, code search and dependency slicing. 
  • The first application we built using the Sourcerer Infrastructure was a code search engine.

Complete information, including Datasets, Tutorials, and a comprehensive Publication List are availble on The Sourcerer Project website.

This work has received a Best Paper Award at Mining Software Repositories 2009 for “Mining Search Topics From A Code Search Engine Usage Logs” by alumnus Sushil Krishna Bajracharya and Prof. Cristina Videira Lopes.

Lopes, C. V., and J. Ossher, "How Scale Affects Structure in Java Programs", ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages and Applications (OOPSLA), Pittsburgh, PA, ACM, pp. 675-694, October, 2015.
Sajnani, H., V. Saini, J. Ossher, and C. Videira Lopes, "Is Popularity a Measure of Quality? An Analysis of Maven Components", IEEE International Conference on Software Maintenance and Evolution (ICSME), Victoria, BC, Canada, pp. 231 - 240, September, 2014.
Lemos, O. A. L., A. C. de Paula, F. C. Zanichelli, and C. V. Lopes, "Thesaurus-Based Automatic Query Expansion for Interface-Driven Code Search", 11th Working Conference on Mining Software Repositories (MSR 2014), Hyderabad, India, ACM, pp. 212-221, May 31-June 1, 2014.
Ossher, J., and C. Lopes, "Applying Program Analysis to Code Retrieval", Finding Source Code on the Web for Remix and Reuse, New York, Springer , pp. 205-225, 2013.
Ossher, J., S. Bajracharya, and C. Lopes, "Automated dependency resolution for open source software", Mining Software Repositories (MSR), 2010 7th IEEE Working Conference on, Cape Town, South Africa, pp. 130-140, 2-3 May 2010.
Bajracharya, S., and C. Lopes, "Mining search topics from a code search engine usage log", 6th IEEE International Working Conference on Mining Software Repositories, 2009 (MSR '09), pp. 111-120, 16-17 May 2009.
Lemos, O A L., S. Bajracharya, J. Ossher, P C. Masiero, and C. Lopes, "Applying test-driven code search to the reuse of auxiliary functionality", Proceedings of the 2009 ACM symposium on Applied Computing, Honolulu, Hawaii, ACM, pp. 476-482, 2009.
Bajracharya, S., J. Ossher, and C. Lopes, "Sourcerer: An internet-scale software repository", ICSE Workshop on Search-Driven Development-Users, Infrastructure, Tools and Evaluation, 2009 (SUITE '09), pp. 1-4, 16-16 May 2009.
Linstead, E., S. Bajracharya, T. Ngo, P. Rigor, C. Lopes, and P. Baldi, "Sourcerer: mining and searching internet-scale software repositories", Data Mining and Knowledge Discovery, vol. 18, no. 2: Springer Netherlands, pp. 300-336, 2009.
Ossher, J., S. Bajracharya, E. Linstead, P. Baldi, and C. Lopes, "SourcererDB: An aggregated repository of statically analyzed and cross-linked open source Java projects", International Workshop on Mining Software Repositories, Los Alamitos, CA, USA, IEEE Computer Society, pp. 183-186, 2009.
Linstead, E., P. Rigor, S. Bajracharya, C. Lopes, and P. Baldi, "Mining Concepts from Code with Probabilistic Topic Models", International Conference on Automated Software Engineering (ASE 2007), Atlanta, GA, November, 2007.
Bajracharya, S., T. Ngo, E. Linstead, Y. Dou, P. Baldi, and C. Lopes, "Sourcerer: A Search Engine for Open Source Code Supporting Structure-Based Search ", OOPSLA '06: Companion to the 21st ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages, and Applications: ACM Press, New York, NY, pp. 681-682, 2006.