Efficient and Reproducible Data Pipelines for Multi-Author Dataset Generation

Thesis Type Bachelor
Thesis Status
Open
Number of Students
1
Thesis Supervisor
Contact
Research Field

The goal of this thesis is to design and implement a generic, efficient data pipeline for creating multi-author datasets, as required for tasks such as authorship attribution and style change detection. Building on existing paragraph and sentence mixing approaches, the pipeline should follow a modular, stage-wise design that allows flexible and fully configurable dataset generation.

A key focus lies on efficiency and robustness: the system should support parallel processing, incremental data writing, and interruption-safe execution. In addition, it should provide built-in dataset validation, configurable logging, and the ability to perform quick dry-runs on small test datasets.