Cross-Topic Authorship Attribution on Social Q&A platforms
Authorship Attribution is a widely researched problem which aims to automatically identify the author of a previously unseen text document. To make the decision, systems have prior access to samples of all potential authors and learn from them, usually by utilizing machine learning methods. Although several data sets exist to test the algorithms, most of them are of a specific topic or genre, and often contain few authors.
In this thesis, the social question and answer (Q&A) network of StackExchange should be used to create a novel data set that can be used to conduct studies on cross-topic authorship attribution. That is, given samples of authors that wrote about topic A, an algorithm should be developed that is able to detect the correct authors, even if they wrote on topic B. To create the dataset, the API of StackExchange can be utilized, i.e., authors have to found that write on different StackExchange sites (e.g., on StackOverflow regarding programming issues, as well as on the Religion or Gaming site). With this dataset, a machine-learned model should be trained and evaluated using state-of-the-art techniques.