On Cross-Dataset Generalization in Automatic Detection of Online Abuse
Reference
Nejadgholi, I., & Kiritchenko, S. (2020). On cross-dataset generalization in automatic detection of online abuse.ย arXiv preprint arXiv:2010.07414.
# Research Questions
Test and training sets were created for each dataset by performing a stratified split of 20% vs 80%, with the larger part used for training the models. The training sets were further subdivided, keeping 1/8 shares of them as separate validation sets during development and fine-tuning of the hyper-parameters.
- Fine-tuning์์์ ์ผ๋ฐ์ ์ธ ๋ฐฉ๋ฒ์ ๋งํ๊ณ ์๋ค. ์ ์ฒด ๋ฐ์ดํฐ๋ฅผ label ๋ถํฌ๋ฅผ ์ ์งํ ์ฑ๋ก
train
,test
์ผ๋ก ๋๋๊ณ , ์ดํtrain
์์validation
์ ๋ค์ ๋๋๋ค. ํนํtest
๋ฐ์ดํฐ์ ์ ํ๋ จ์ ์ฐ์ด์ง ์๋๋ฐ, ์ดํ ํ์ตํ ๋ชจ๋ธ์ ์ผ๋ฐํ ์ฑ๋ฅ์ ํ๊ฐํ ๋ ์ฌ์ฉํ๋ค. ๊ทธ๋์test
๋ฐ์ดํฐ์ ์์ ์ฑ๋ฅ์ด ์ ๋์จ๋ค๋ฉด, ํด๋น ๋ชจ๋ธ์ด ๋ค๋ฅธ ๋ฐ์ดํฐ์ ์์๋ ์ฑ๋ฅ์ด ์ ๋์ฌ ๊ฒ์ด๋ผ๋ ๊ฐ์ค์ ์ธ์ธ ์ ์๋ค. - ๊ทธ๋ฌ๋ ๋ณธ ๋ ผ๋ฌธ์ ์ด ๊ฐ์ค์ ์๋ฌธ์ ์ ๊ธฐํ๋ค.
(…) the aim here, in contrast, was to see how well the best models (that may have learnt some dataset-specific biases) performed on other datasets. This was done to investigate how well state-of-the-art systems perform in a real-life scenario, i.e., when exposed to data from other domains, with the hypothesis that a model trained on one dataset that exhibits comparatively reasonable results on other datasets can be expected to generalise well.
- ์ด ๋ ผ๋ฌธ์ ๋ชฉ์ ์ ๊ทธ๋ ๊ฒ ํ ๋ฐ์ดํฐ์ ์ ์ ํ์ตํ(์๋ง๋ ๊ทธ ๋ฐ์ดํฐ์ ์ ๋ด์ฌํ ํธํฅ๋ ์ ํ์ตํ) ๋ชจ๋ธ์ด ๋ค๋ฅธ ๋ฐ์ดํฐ์ ์ ์ผ๋ง๋ ์ฑ๋ฅ์ด ์ข์์ง ๋ณด๋ ๊ฒ์ด๋ค. ์ด๊ฑด ์ค์ ์ธ๊ณ์์์ ์ํฉ๊ณผ ์ ์ฌํ๋ฐ, ๋ชจ๋ธ์ ๊ฒฐ๊ตญ ๋ค๋ฅธ ๋๋ฉ์ธ์์ ์์ฑ๋ ๋ฐ์ดํฐ์ ๋ ธ์ถ๋ ์ ๋ฐ์ ์๊ธฐ ๋๋ฌธ์ด๋ค.
- ์ด๋ฅผ ํตํด ‘ํ ๋ฐ์ดํฐ์ ์ ์ ํ์ตํ์ฌ ์ข์ ์ฑ๋ฅ์ ๋ด๋ ๋ชจ๋ธ์ด๋ผ๋ฉด, ๋ค๋ฅธ ๋ฐ์ดํฐ์ ์๋ ์ ์ผ๋ฐํ๋ฅผ ํ ์ ์์ ๊ฒ‘์ด๋ผ๋ ๊ฐ์ค์ ์ค์ ๋ก ํ์ธํด๋ณด๋ ๊ฒ์ด๋ค.
# ์คํ ๋ฐฉ๋ฒ
To explore how well the Toxic class from the Wiki-dataset generalizes to other types of offensive behaviour, we train a binary classifier (Toxic vs. Normal) on the Wiki-dataset (combining the train, development and test sets) and test it on the Out-of-Domain Test set. This classifier is expected to predict a positive (Toxic) label for the instances of classes Founta-Abusive, Founta-Hateful, Waseem-Sexism and Waseem-Racism, and a negative (Normal) label for the tweets in the Founta-Normal class. We fine-tune a BERT-based classifier (Devlin et al., 2019) with a linear prediction layer, the batch size of 16 and the learning rate of 2 ร 10โ5 for 2 epochs.
- ์ ์๋ค์ Wiki-dataset์ผ๋ก ํ๋ จํ ๋ชจ๋ธ์ด ๋ค๋ฅธ ๋ฐ์ดํฐ์ ์ ์ผ๋ง๋ ์ ์ผ๋ฐํํ๋๊ฐ๋ฅผ ๋ณด๊ธฐ ์ํด, binary classifier๋ฅผ wiki-Dataset์ผ๋ก ํ๋ จ์ํค๊ณ ‘๋๋ฉ์ธ ์ธ ํ ์คํธ์ (the Out-of-Domain Test set)’ ์ ์ด๋ฅผ ํ ์คํธ ํ๋ค. ๋ชจ๋ธ์ BERT๋ฅผ ์ฌ์ฉํ๋ค.
# ์คํ ๊ฒฐ๊ณผ
Results: The overall performance of the classifier on the Out-of-Domain test set is quite high: weighted macro-averaged F1 = 0.90.
- ์ ์๋ค์ ์์๊ณผ ๋ฌ๋ฆฌ ์ ์ฒด์ ์ธ Out-of-Domain test ์ฑ๋ฅ์ ๋์ ํธ์ด์๋ค. ๊ทธ๋ฌ๋ Waseem ๋ฐ์ดํฐ์ ์ Sexist, Racist class๋ฅผ ๋ถ๋ฅํ๋ ๋ฐ์๋ Wiki-Dataset์ Toxic class๋ก ํ๋ จ๋ ๋ชจ๋ธ์ด ์ ํฉํ์ง ์๋ค๋ ์ฌ์ค์ ํ์ธํ๋ค.
# Formulation์ ๋ํ ๋ ผ์
#key-observation
The impact of task formulation: From task formulations described in Section 3, observe that the Wiki-dataset defines the class Toxic in a general way. The class Founta-Abusive is also a general formulation of offensive behaviour. The similarity of these two definitions is reflected clearly in our results.
- ํฅ๋ฏธ๋ก์ด ๋ถ์์ Formulation์ ๋ํ ๊ฒ์ด๋ค. ๋จผ์ Wiki dataset์ Tosic class์ ๋ํ ์ ์๋ ๋ค์๊ณผ ๊ฐ๋ค : ‘The class Toxic comprises rude, hateful, aggressive, disrespectful or unreasonable comments that are likely to make a person leave a conversation’.
- ๊ทธ๋ฐ๋ฐ ์ด๊ฒ์ด, Waseem ๋ฐ์ดํฐ์ ์ Sexist, Racist class๋ฅผ ๋ถ๋ฅํ๊ธฐ์๋ ๋ค์ ์ผ๋ฐ์ ์ธ ์ ์๋ผ๋ ๊ฒ์ด๋ค.
# Impact of Data Size on Generalizability
#data-size
Observe that the average accuracies remain unchanged when the datasetโs size triples at the same class balance ratio. This finding contrasts with the general assumption that more training data results in a higher classification performance.
- ๋ค์์ผ๋ก ์ ์๋ ๋ ํฅ๋ฏธ๋ก์ด ํฌ์ธํธ๋ฅผ ํ๋ ๋ ํ์ธํ๋ค.
- ๋ง์ฝ class์ ๋น์จ์ด ๋ณํ์ง ์๋๋ค๋ฉด ๋ฐ์ดํฐ์ ํฌ๊ธฐ๊ฐ ์ปค์ง๋๋ผ๋ ์ ํ๋(
accuracy
)๋ ๋ณํ์ง ์๋๋ค๋ ๊ฒ์ด๋ค. ์ด๋ ๋ ๋ง์ ํ๋ จ๋ฐ์ดํฐ๊ฐ ํญ์ ๋์ ๋ถ๋ฅ ์ฑ๋ฅ์ ๋ธ๋ค๋ general assumption๊ณผ ๋ฐ๋๋๋ ๊ฒฐ๊ณผ์ด๋ค.
# Discussion
In the task of online abuse detection, both False Positive and False Negative errors can lead to significant harm as one threatens the freedom of speech and ruins peopleโs reputations, and the other ignores hurtful behaviour.
- False Positive์ False Negative๋ ํํ์ ์์ ๋ฅผ ์ํํ ์ ์๋ค.
We suggest evaluating each class (both positive and negative) separately taking into account the potential costs of different types of errors.
- ๊ทธ๋ฆฌ๊ณ ์ ์๋ค์ ๊ฐ class๋ฅผ ๋ฐ๋ก ํ๊ฐํ๋ ๊ฒ์ ์ ์ํ๋ค.