🪴 Hayul's digital garden

Search

Search IconIcon to open search

Towards generalisable hate speech detection

Last updated Dec 12, 2022 Edit Source

Reference

Yin, W., & Zubiaga, A. (2021). Towards generalisable hate speech detection: a review on obstacles and solutions. PeerJ Computer Science7, e598.


# Generalisation

Most if not all proposed hate speech detection models rely on supervised machine learning methods, where the ultimate purpose is for the model to learn the real relationship between features and predictions through training data, which generalises to previously unobserved inputs (Goodfellow, Bengio & Courville, 2016). The generalisation performance of a model measures how well it fulfils this purpose.

The ultimate purpose of studying automatic hate speech detection is to facilitate the alleviation of the harms brought by online hate speech. To fulfil this purpose, hate speech detection models need to be able to deal with the constant growth and evolution of hate speech, regardless of its form, target, and speaker.

#key-observation

Recent research has raised concerns on the generalisability of existing models (Swamy, Jamatia & Gambäck, 2019). Despite their impressive performance on their respective test sets, the performance significantly dropped when the models are applied to a different hate speech dataset. This means that the assumption that test data of existing datasets represent the distribution of future cases is not true, and that the generalisation performance of existing models have been severely overestimated (Arango, Prez & Poblete, 2020). This lack of generalisability undermines the practical value of these hate speech detection models.

Note

이 부분이 내가 하고 있는 연구의 핵심이다! 모델의 일반화 성능이 과대평가되어 있다는 것. 한국어 데이터셋과 모델로도 비슷한 결과가 나오는지 보는 것.

# Data

[예시 1] For example, in Wiegand, Ruppenhofer & Kleinbauer (2019)’s study, FastText models (Joulin et al., 2017a) trained on three datasets (Kaggle, Founta, Razavi) achieved F1 scores above 70 when tested on one another, while models trained or tested on datasets outside this group achieved around 60 or less.

#key-observation

Founta and OLID produced models that performed well on each other. The source of such differences are usually traced back to search terms (Swamy, Jamatia & Gambäck, 2019), topics covered (Nejadgholi & Kiritchenko, 2020; Pamungkas, Basile & Patti, 2020), label definitions (Pamungkas & Patti, 2019; Pamungkas, Basile & Patti, 2020; Fortuna, Soler-Company & Wanner, 2021), and data source platforms (Glavaš, Karan & Vulić, 2020; Karan & Šnajder, 2018).

Fortuna, Soler & Wanner (2020) used averaged word embeddings (Bojanowski et al., 2017; Mikolov et al., 2018) to compute the representations of classes from different datasets, and compared classes across datasets. One of their observations is that Davidson’s ‘‘hate speech’’ is very different from Waseem’s ‘‘hate speech’’, ‘‘racism’’, ‘‘sexism’’, while being relatively close to HatEval’s ‘‘hate speech’’ and Kaggle’s ‘‘identity hate’’. This echoes with experiments that showed poor generalisation of models from Waseem to HatEval (Arango, Prez & Poblete, 2020) and between Davidson and Waseem (Waseem, Thorne & Bingel, 2018; Gröndahl et al., 2018).

In terms of what properties of a dataset lead to more generalisable models, there are frequently mentioned factors (…)

Biases in the samples are also frequently mentioned. Wiegand, Ruppenhofer & Kleinbauer (2019) hold that less biased sampling approaches produce more generalisable models. This was later reproduced by Razo & Kübler (2020) and also helps explain their results with the two datasets that have the least positive cases. Similarly, Pamungkas & Patti (2019) mentioned that a wider coverage of phenomena lead to more generalisable models.

Another way of looking at generalisation and similarity is by comparing differences between individual classes across datasets (Nejadgholi & Kiritchenko, 2020; Fortuna, Soler & Wanner, 2020; Fortuna, Soler-Company & Wanner, 2021), as opposed to comparing datasets as a whole.

# OBSTACLES TO GENERALISABLE HATE SPEECH DETECTION

Hate speech detection, which is largely focused on social media, shares similar challenges to other social media tasks and has its specific ones, when it comes to the grammar and vocabulary used. Such user language style introduces challenges to generalisability at the data source, mainly by making it difficult to utilise common NLP pre-training approaches.

On social media, syntax use is generally more casual, such as the omission of punctuation (Blodgett & O’Connor, 2017). Alternative spelling and expressions are also used in dialects (Blodgett & O’Connor, 2017), to save space, and to provide emotional emphasis (Baziotis, Pelekis & Doulkeridis, 2017). Sanguinetti et al. (2020) provided extensive guidelines for studying such phenomena syntactically.

Qian et al. (2018) found that rare words and implicit expressions are the two main causes of false negatives; Van Aken et al. (2018) compared several models that used pre-trained word embeddings, and found that rare and unknown words were present in 30% of the false negatives of Wikipedia data and 43% of Twitter data.

Indeed, BERT (Devlin et al., 2019) and its variants have demonstrated top performances at hate or abusive speech detection challenges recently (Liu, Li & Zou, 2019; Mishra & Mishra, 2019).

It is particularly challenging to acquire labelled data for hate speech detection as knowledge or relevant training is required of the annotators. As a high-level and abstract concept, the judgement of ‘‘hate speech’’ is subjective, needing extra care when processing annotations. Hence, datasets are usually not big in size.

Moreover, different studies are based on varying definitions of ‘‘hate speech’’, as seen in different annotation guidelines (Table 5). Despite all covering the same two main aspects (directly attack or promote hate towards), datasets vary by their wording, what they consider a target (any group, minority groups, specific minority groups), and their clarifications on edge cases. Davidson and HatEval both distinguished ‘‘hate speech’’ from ‘‘offensive language’’, while ‘‘uses a sexist or racist slur’’ is in Waseem’s guidelines to mark a case positive of hate, blurring the boundary of offensive and hateful. Additionally, as both HatEval and Waseem specified the types of hate (towards women and immigrants; racism and sexism), hate speech that fell outside of these specific types were not included in the positive classes, while Founta and Davidson included any type of hate speech.