Efficient and Reproducible Data Pipelines for Multi-Author Dataset Generation

Thesis Type	Bachelor
Thesis Status	Open
Number of Students	1
Thesis Supervisor	Assoc. Prof. Dr. Eva Zangerle
Contact	eva.zangerle@uibk.ac.at
Research Field	Authorship Analysis and Cross-Language Grammar Features

The goal of this thesis is to design and implement a generic, efficient data pipeline for creating multi-author datasets, as required for tasks such as authorship attribution and style change detection. Building on existing paragraph and sentence mixing approaches, the pipeline should follow a modular, stage-wise design that allows flexible and fully configurable dataset generation.

A key focus lies on efficiency and robustness: the system should support parallel processing, incremental data writing, and interruption-safe execution. In addition, it should provide built-in dataset validation, configurable logging, and the ability to perform quick dry-runs on small test datasets.