We are happy to announce that our paper, "Can Third-Party Annotators Reliably Evaluate Conversational Recommender Systems?", has been accepted for presentation at the 34th ACM Conference on User Modeling, Adaptation and Personalization, taking place in Gothenburg, Sweden.
This paper investigates the reliability of using third-party crowd workers to evaluate the user experience of Conversational Recommender Systems. By analyzing over 1,000 annotations of dialogue logs, we show that while annotators can consistently judge utilitarian qualities like accuracy, they struggle to reliably assess social constructs like humanness and rapport. Our findings challenge current evaluation practices and emphasize the need for more robust, multi-rater protocols in the field.