🪴 Hayul's digital garden

Search

Search IconIcon to open search

Hate speech detection is not as easy as you may think

Last updated Dec 12, 2022 Edit Source

Reference

Aymé Arango, Jorge Pérez, and Barbara Poblete. 2019. Hate Speech Detection is Not as Easy as You May Think: A Closer Look at Model Validation. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'19). Association for Computing Machinery, New York, NY, USA, 45–54. https://doi.org/10.1145/3331184.3331262


# Introduction

Despite the apparent difficulty of the hate speech detection problem evidenced by social-media providers, current state-of- the-art approaches reported in the literature show near-perfect performance. Within-dataset experiments on labeled hate speech datasets using supervised learning achieve F1 scores above 93% [3, 7–9]. Nevertheless, there are only a few studies towards determining how generalizable the resulting models are, beyond the data collection upon which they were built on, nor on the factors that may affect this property [10]. Furthermore, recent literature that surveys current work also views the state-of-the-art under a more conservative and cautious light [3,10].

Aim of this paper

“(…) measure how these models would perform on similar yet different datasets.”

There is some recent work on testing the generalization of the state-of-the-art methods to other datasets and domains [8–10]. Most of this work has been focused on Deep Learning methods. Agrawal and Awekar [8] test the performance of models trained on tweets [11] classifying on Wikpedia data [33] and Formspring data [34]. The authors show that transfer learning from Twitter to the two other domains performs poorly achieving less than 10% F1. In a similar study, Dadvar and Eckert [9] perform transfer learning from Twitter to a dataset of Youtube comments [35] showing a performance of 15% F1. Gröndahl et al. [10] present a comprehensive study reproducing several state-of-the-art models. Specially important for us is the experiment transferring Badjatiya et al.’s model [7] trained on the Waseem and Hovy’s dataset [11] to two other similarly labeled tweet datasets [16,17]. Even in this case the performance drops significantly, obtaining 33% and 47% F1 in those sets. This is a 40+% drop from the 93% F1 reported by Badjatiya et al. [7]. From these results, Gröndahl et al. [10] draw as a conclusion that model architecture is less important than the type of data and labeling criteria being used. In this paper our results are coherent with those of Gröndahl et al. [10]. However, we take our research a step further by investigating why this issue occurs.