Authorship Obfuscated Vector Representations for Text Mining

Thesis Type Master
Thesis Status
Student Daniel Egger
Thesis Supervisor
Research Field

Analyzing a text’s linguistic style can be a threat to the privacy of authors who wish to conceal their identity. Automated authorship attribution methods using text mining techniques are getting increasingly more accurate. Still, research on authorship obfuscation shows that authorship attribution methods can be disturbed by altering the linguistic style of texts. Currently, research on such authorship obfuscation methods focuses mainly on producing obfuscated, but human readable transcriptions of its input texts. State-of-the-art authorship obfuscation methods struggle with a negative correlation of obfuscation safety and human readability, often needing to sacrifice safety to keep obfuscated texts human readable. The generation of obfuscated text representations for use in text mining tasks such as topic classification or sentiment analysis is less explored. In text mining, texts are commonly represented as numerical vectors. Since human readability is not a concern in that case, obfuscation methods can focus on obfuscation safety. This thesis develops and explores methods for generating obfuscated text vector representations for use in utility text mining tasks other than authorship attribution and author profiling. The discussed methods are evaluated regarding their safety against authorship attribution attacks as well as their accuracy in utility text mining tasks.