PubCAP: Crawling of Online Academic Publications and Distributed Analysis in Cluster Environments

Thesis Type Bachelor
Thesis Status
Finished
Student Maximilian Gerhardt, Simon Klausner
Init
Final
Start
Thesis Supervisor
Contact

This thesis is about the development of a framework for crawling of scientific documents on the internet, as well as their analysis. Many scientific documents and thesis are available on the internet, and may be of great interest for scientific and academic work. Further information and scientific appendages could also be extracted by the analysis of these documents and their links. Therefore a framework had to be designed that could handle additional tasks on these scientific documents for search purposes and analysis. Technologies like hadoop and MongoDB were used to guarantee a highly scalable and distributed framework for this task. About 1900000 records from the library DBLP were imported and so 700000 PDFs can be downloaded to the file system.