Evaluation and Comparison of Hadoop Technologies for Genetic Data Analyses

Thesis Type Master
Thesis Status
Student Clemens Banas
Thesis Supervisor
Research Field

As data volume in Genetics is constantly increasing, it is key to utilize scalable big data technologies to process large genomic studies. The selection of a specific technology is crucial, whereby Apache Hadoop and Apache Spark are two promising technologies to tackle the demands. The aim of this thesis is to compare the advantages/disadvantages of these state-of-the-art technologies and to evaluate them on the three most important genetic data formats FASTQ, BAM and VCF.