Evaluation and Comparison of Hadoop Technologies for Genetic Data Analyses

Thesis Type Master
Thesis Status
Finished
Student Clemens Banas
Final
Start
Thesis Supervisor
Contact
Research Field

As data volume in Genetics is constantly increasing, it is key to utilize scalable big data technologies to process large genomic studies. The selection of a specific technology is crucial, whereby Apache Hadoop and Apache Spark are two promising technologies to tackle the demands. The aim of this thesis is to compare the advantages/disadvantages of these state-of-the-art technologies and to evaluate them on the three most important genetic data formats FASTQ, BAM and VCF.