2024  |  23  |  22  |  21  |  20  |  19  |  18  |  17  |  16  |  15  |  14  |  13  |  12  |  11  |  10  |  09  |  08  |  07  |  06  |  05  |  04  |  03  |  02  |  01  |  00  |  99

ISR Projects by Pedro Ribeiro Martins

Previous studies have shown that there is a non-trivial amount of duplication in source code. We analyzed a corpus of 2.6 million non-fork projects hosted on GitHub representing over 258 million files written in Java, C++ Python and JavaScript. We found that this corpus has a mere 54 million unique files. In other words, 79% of the code on GitHub consists of clones of previously created files. There is considerable variation between language ecosystems. JavaScript has the highest rate of file duplication, only 7% of the files are distinct.

Project Dates: 
January 2017