2017  |  2016  |  2015  |  2014  |  2013  |  2012  |  2011  |  2010  |  2009  |  2008  |  2007  |  2006  |  2005  |  2004  |  2003  |  2002  |  2001  |  2000

Clone Mapping Research

Previous studies have shown that there is a non-trivial amount of duplication in source code. We analyzed a corpus of 2.6 million non-fork projects hosted on GitHub representing over 258 million files written in Java, C++ Python and JavaScript. We found that this corpus has a mere 54 million unique files. In other words, 79% of the code on GitHub consists of clones of previously created files. There is considerable variation between language ecosystems. JavaScript has the highest rate of file duplication, only 7% of the files are distinct.

Project Dates: 
January 2017