Source Code Reuse Detection based on Syntax Trees
Within our research group various approaches have been developed to find potential plagiarism in text documents. One promising algorithm attempts to find plagiarism by utilizing only the grammatical structure of the text. Thereby the document is split into sentences, and for each sentence a grammar tree is calculated, which represent then the basis for further analysis.
Programming languages also are restricted to predefined grammar rules, i.e., code has to adhere to this rules. The more complex the grammar is, the more diverse code can be written and the more unique programming styles get. The aim of this thesis is to examine, if grammar analysis like in the existing algorithms for text can also be used on source code, i.e., to reveal potential plagiarism in source code (like Java). Additionally to pure grammar also other metrics should be added, e.g., names of variables, lines per code per method etc, in order to enhance the final accuracy. To be able to evaluate the developed algorithm, public repositories should be crawled and used for extensive evaluations on different programming languages.