Mixing Algorithms for the Creation of Synthetic Multi-Author Datasets

Thesis Type Bachelor
Thesis Status
Student Johannes Mario Hammerer
Thesis Supervisor
Research Field

The goal of multi-author analysis is to investigate methods to analyze and characterize the writing style of authors. Multi-author analysis can pave the way for tasks like detecting the positions at which the author changes, or authorship attribution (determining the author of a given text). Developing and training models for multi-author analysis requires a sufficient amount of training data containing texts written by multiple authors with labels specifying the author of each section. The goal of this thesis is to devise a paragraph and sentence mixing framework that allows to flexibly create datasets of different complexity w.r.t. the task of detecting style changes (i.e., determining the exact positions at which the author changes based on stylistic fingerprints of authors). This includes introducing sophisticated methods for mixing paragraphs and sentences of different authors, for instance, based on text similarities properties (the more similar the assembled paragraphs of different authors are, the more complex the task of detecting a style change in between these paragraphs becomes).