Text Mining on Wikipedia

Thesis Type Master
Thesis Status
Number of Students
Thesis Supervisor
Research Field

Text Mining deals with several research problems, including authorship attribution (who write it), author profiling (what is the author like) or automatic segmentation (which parts within a text document are written by which author). To evaluate algorithms for these tasks, several data sets exists, but which have some limitations (e.g., only few authors or documents are available for selected topics).

Wikipedia is a freely available rich source for text documents, which provides information about the authors. Thus it is generally suited for text mining problems, but has one major drawback: many Wikipedia articles are written and edited by multiple writers (e.g., one sentence is deleted, another word is inserted, a link is altered etc.) - so the real author of an article or even a paragraph is often ambiguous, as there may have been several edits.

Recently, with TokTrack a database has been published that stores all edits of every token (word) of all Wikipedia articles (e.g., the second word was originally written by author A, then deleted by editor B, reinserted by editor C and finally moved to another paragraph by editor D). Using TokTrack, the original authorship of Wikipedia articles can be reconstructed with high accuracy.

The aim of this thesis is to create a new data set from Wikipedia using TokTrack, that contains text passages and their corresponding authors. Longer text passages that are not edited should finally be used to evaluate several text mining algorithms, including authorship attribution and author profiling.