One of the many challenges of software development and maintenance is the need to collaborate among many constituents and stakeholders. For example, clients interact with software development organizations; software-development organizations consist of many developers and maintainers within the same location and across different locations; and the development organization often outsources some of the testing efforts to independent test agencies. Each of these parties may reside in different locations, often across many very disparate time zones.
We developed a fault-localization technique that utilized correlation-based heuristics. The technique and tool was called Tarantula. Tarantula uses the pass/fail statuses of test cases and the events that occurred during execution of each test case to offer the developer recommendations of what may be the faults that are causing test-case failures. The intuition of the approach is to find correlations between execution events and test-case outcomes --- those events that correlate most highly with failure are suggested as places to begin investigation.
Collaboration is becoming ubiquitious; at the same time the emergence of new technologies have been changing the landscape of interaction and collaboration. I am interested in the effect that information technologies have on collaboration and the development of new organizational practices such as network-centricity, group-to-group collaboration, nomadic work, and large-scale collaboration. I am also very interested in how Web 2.0 technologies (blogs, wikis, social-networking sites, etc.) are used in collaboration and how they can be integrated into the course of daily work.
We developed a token-based approach for large scale code clone detection which is based on a filtering heuristic that reduces the number of token comparisons when the two code blocks are compared. We also developed a MapReduce based parallel algorithm that uses the filtering heuristic and scales to thousands of projects. The filtering heuristic is generic and can also be used in conjunction with other token-based approaches. In that context, we demonstrated how it can increase the retrieval speed and decrease the memory usage of the index-based approaches.
Microtask crowdsourcing systems such as FoldIt and ESP partition work into short, self-contained microtasks, reducing barriers to contribute, increasing parallelism, and reducing the time to complete work. Could this model be applied to software development? To explore this question, we are designing a development process and cloud-based IDE for crowd development.
The development of a software system is now ever more frequently a part of a larger development effort, including multiple software systems that co-exist in the same environment: a software ecosystem. Though most studies of the evolution of software have focused on a single software system, there is much that we can learn from the analysis of a set of interrelated systems. Topic modeling techniques show promise for mining the data stored in software repositories to understand the evolution of a system.
One method of facilitating developers to understand the complex inner nature of software that we have employed is the use of information visualization. Software is often so complex that even the developers who initially created it cannot understand all of the possible runtime behaviors that it can exhibit --- specifically, all of the bugs that it may contain. In order to present large code bases with innumerable characteristics and relationships of its components (e.g., instructions, variables, values, and timings) we have developed a number of novel visualizations of software.
Sourcerer is an ongoing research project at the University of California, Irvine aimed at exploring open source projects through the use of code analysis. The existence of an extremely large body of open source code presents a tremendous opportunity for software engineering research. Not only do we leverage this code for our own research, but we provide the open source Sourcerer Infrastructure and curated datasets for other researchers to use.
The Sourcerer Infrastructure is composed of a number of layers.