Real-time Twitter Sentiment Classification based on Apache Storm

Thesis Type Master
Thesis Status
Student Martin Illecker
Thesis Supervisor
Research Field

The main goal of this master’s thesis is to integrate techniques of sen- timent classification within a real-time processing system. Therefore, it presents an approach called SentiStorm, which is based on Apache Storm and uses different machine learning techniques to identify the sentiment of a tweet. SentiStorm uses Part-of-Speech (POS) tags, Term Frequency–Inverse Document Frequency (TF-IDF) and multiple sentiment lexica to extract a feature vector out of a tweet. This extracted feature vector is processed by a Support Vector Machine (SVM), which predicts the sentiment based on a trained dataset.

Finally, this thesis will present the evaluation of SentiStorm based on the Semantic Evaluation (SemEval) dataset of 2013. The quality evaluation shows that SentiStorm is comparable with state-of-art sentiment classification systems. In addition to its high prediction quality, the per- formance results proof the possibility to run this sentiment classification also in real-time.