Information Extraction from Medical Reports of Patients with Multiple Myeloma Using Machine Learning

Thesis Type Master
Thesis Status
Student Jan Schlenker
Thesis Supervisor
External Supervisor
Univ.-Prof. Dr. Bernhard Holzner, Priv.-Doz. Dr. Gerhard Rumpold
Research Field

The majority of Austrian medical reports is still in the form of free narrative text, which makes the automatized extraction of important medical information difficult. In recent years Machine Learning (ML) approaches have been successfully applied to tackle the problem. In this thesis two ML algorithms are evaluated for Austrian reports from patients with multiple myeloma, namely Support Vector Machines (SVMs) and Conditional Random Fields (CRFs), to extract the severity of the disease, the type of the myeloma and cytogenetic anomalies. The required training data is created with the help of a custom token tagging program, whose development is also part of this work. Results show that CRFs generally outperform SVMs, with top F1 scores 0.928 versus 0.765. These high scores indicate that clinics or clinical partners in Austria may utilize ML approaches for the extraction of further medical information and studies.