๐Ÿชด Hayul's digital garden

Search

Search IconIcon to open search

On Cross-Dataset Generalization in Automatic Detection of Online Abuse

Last updated Dec 12, 2022 Edit Source

Reference

Nejadgholi, I., & Kiritchenko, S. (2020). On cross-dataset generalization in automatic detection of online abuse.ย arXiv preprint arXiv:2010.07414.


# Research Questions

Test and training sets were created for each dataset by performing a stratified split of 20% vs 80%, with the larger part used for training the models. The training sets were further subdivided, keeping 1/8 shares of them as separate validation sets during development and fine-tuning of the hyper-parameters.

(…) the aim here, in contrast, was to see how well the best models (that may have learnt some dataset-specific biases) performed on other datasets. This was done to investigate how well state-of-the-art systems perform in a real-life scenario, i.e., when exposed to data from other domains, with the hypothesis that a model trained on one dataset that exhibits comparatively reasonable results on other datasets can be expected to generalise well.

# ์‹คํ—˜ ๋ฐฉ๋ฒ•

To explore how well the Toxic class from the Wiki-dataset generalizes to other types of offensive behaviour, we train a binary classifier (Toxic vs. Normal) on the Wiki-dataset (combining the train, development and test sets) and test it on the Out-of-Domain Test set. This classifier is expected to predict a positive (Toxic) label for the instances of classes Founta-Abusive, Founta-Hateful, Waseem-Sexism and Waseem-Racism, and a negative (Normal) label for the tweets in the Founta-Normal class. We fine-tune a BERT-based classifier (Devlin et al., 2019) with a linear prediction layer, the batch size of 16 and the learning rate of 2 ร— 10โˆ’5 for 2 epochs.

# ์‹คํ—˜ ๊ฒฐ๊ณผ

Table 3

Results: The overall performance of the classifier on the Out-of-Domain test set is quite high: weighted macro-averaged F1 = 0.90.

# Formulation์— ๋Œ€ํ•œ ๋…ผ์˜

#key-observation

The impact of task formulation: From task formulations described in Section 3, observe that the Wiki-dataset defines the class Toxic in a general way. The class Founta-Abusive is also a general formulation of offensive behaviour. The similarity of these two definitions is reflected clearly in our results.

# Impact of Data Size on Generalizability

#data-size

Observe that the average accuracies remain unchanged when the datasetโ€™s size triples at the same class balance ratio. This finding contrasts with the general assumption that more training data results in a higher classification performance.

# Discussion

In the task of online abuse detection, both False Positive and False Negative errors can lead to significant harm as one threatens the freedom of speech and ruins peopleโ€™s reputations, and the other ignores hurtful behaviour.

We suggest evaluating each class (both positive and negative) separately taking into account the potential costs of different types of errors.