Sentence Boundary Disambiguation in Colloquial Texts

Thesis Type Bachelor
Thesis Status
Student Sebastian Hepp
Thesis Supervisor

Sentence boundary disambiguation (SBD) is the task of splitting a text up into individual sentences. This task can be approached in a number of ways, employing various different machine learning models including hidden Markov models, neural networks, and support vector machines, to name a few. Most existing models for sentence boundary disambiguation are trained on "proper" texts like newspaper articles or novels. Less work has been done on models that work on more colloquial texts that don't strictly follow the grammatical or orthographical rules of the language.

The goal of this thesis is to build an SBD model that is optimized for colloquial texts. For this, transcripts of userinteractions with a digital assistent are used as data set.