Mixing Algorithms for the Creation of Synthetic Multi-Author Datasets

Thesis Type	Bachelor
Thesis Status	Finished
Student	Johannes Mario Hammerer
Init	08.03.2022 12:00
Final	27.09.2022 12:00
Start	19.01.2022 12:00
Thesis Supervisor	Maximilian Mayerl, MSc. Prof. Dr. Günther Specht
Contact	maximilian.mayerl@uibk.ac.at
Research Field	Authorship Analysis and Cross-Language Grammar Features

The goal of multi-author analysis is to investigate methods to analyze and characterize the writing style of authors. Multi-author analysis can pave the way for tasks like detecting the positions at which the author changes, or authorship attribution (determining the author of a given text). Developing and training models for multi-author analysis requires a sufficient amount of training data containing texts written by multiple authors with labels specifying the author of each section. The goal of this thesis is to devise a paragraph and sentence mixing framework that allows to flexibly create datasets of different complexity w.r.t. the task of detecting style changes (i.e., determining the exact positions at which the author changes based on stylistic fingerprints of authors). This includes introducing sophisticated methods for mixing paragraphs and sentences of different authors, for instance, based on text similarities properties (the more similar the assembled paragraphs of different authors are, the more complex the task of detecting a style change in between these paragraphs becomes).