Efficient and Reproducible Data Pipelines for Multi-Author Dataset Generation
| Thesis Type | Bachelor |
| Thesis Status |
Open
|
| Number of Students |
1
|
| Thesis Supervisor | |
| Contact | |
| Research Field |
The goal of this thesis is to design and implement a generic, efficient data pipeline for creating multi-author datasets, as required for tasks such as authorship attribution and style change detection. Building on existing paragraph and sentence mixing approaches, the pipeline should follow a modular, stage-wise design that allows flexible and fully configurable dataset generation.
A key focus lies on efficiency and robustness: the system should support parallel processing, incremental data writing, and interruption-safe execution. In addition, it should provide built-in dataset validation, configurable logging, and the ability to perform quick dry-runs on small test datasets.