Evaluating NLG systems
- Types of evaluation methods for text generation
- Content overlap metrics
- Model-based Metrics
- Human Evaluations
- Content overlap metrics
- Compute a score that indicates the similarity between generated and gold-standard (human-written) text
- Fast and efficient and widely used
- Two broad categories:
- N-gram overlap metrics (e.g., BLEU, ROUGE, METEOR, CIDEr, etc.)
- Theyβre not ideal for machine translation.
- Semantic overlap metrics (e.g., PYRAMID, SPICE, SPIDEr, etc.
- N-gram overlap metrics (e.g., BLEU, ROUGE, METEOR, CIDEr, etc.)
- Model-based metrics
- Use learned representations of words and sentences to compute semantic similarity between generated and reference texts.
- No more n-gram bottleneck because text units are represented as embeddings!
- Even though embeddings are pretrained, distance metrics used to measure the similarity can be fixed.
- Model-based metrics: Word distance functions
- Vector Similarity
- Embedding Average (Liu et al., 2016)
- Vector Extrema (Liu et al., 2016)
- MEANT (Lo, 2017)
- YISI (Lo, 2019)
- Word Moverβs Distance
- BERT SCORE
- Uses pre-trained contextual embeddings from BERT and matches words in candidate and reference sentences by cosine similarity. (Zhang et.al. 2020)
- Vector Similarity
- Model-based metrics: Beyond word matching
- Sentence Movers Similarity
- BLEURT
- A regression model based on BERT returns a score that indicates to what extend the candidate text is grammatical and conveys the meaning of the reference text. (Sellam et.al. 2020)
- Human evaluations
- Most important form of evaluation for text generation systems
- Ask humans to evaluate the quality of generated text
Note
Donβt compare human evaluation scores across differently-conducted studies. Even if they claim to evaluate the same dimensions!
- Learning from human feedback
- ADEM
- HUSE
Evaluation: Takeaways
- Content overlap metrics provide a good starting point for evaluating the quality of generated text, but theyβre not good enough on their own.
- Model-based metrics are can be more correlated with human judgment, but behavior is not interpretable.
- Human judgments are critical.
- Only ones that can directly evaluate factuality β is the model saying correct things?
- But humans are inconsistent!
- In many cases, the best judge of output quality is YOU!
- νμ΅μ΄ νμ μλ μλ μ²λ: μμ±λ ν μ€νΈμ μ°Έμ‘° ν μ€νΈ κ°μ μ μ¬λλ₯Ό μΈ‘μ νλ μνμ μΈ κ³΅μμ μ¬μ©νλ λ°©λ²μ λλ€. μ΄ λ°©λ²μ κ³μ°νκΈ° μ½κ³ μ λ ΄νμ§λ§ μΈμ΄μ λ€μμ±μ λμνκΈ° μ΄λ ΅κ³ μΈκ°μ νλ¨κ³Ό μΌμΉνμ§ μμ μ μμ΅λλ€. μλ₯Ό λ€λ©΄ BLEU, ROUGE, METEOR λ±μ΄ μμ΅λλ€.
- κΈ°κ³ νμ΅λ μ²λ: μμ±λ ν μ€νΈμ νμ§μ΄λ μ μ©μ±μ μμΈ‘νκΈ° μν΄ λ°μ΄ν°λ‘λΆν° νμ΅ν λͺ¨λΈμ μ¬μ©νλ λ°©λ²μ λλ€. μ΄ λ°©λ²μ μΈκ°μ νλ¨κ³Ό λ μ μΌμΉνκ³ μΈμ΄μ λ€μμ±μ λ μ μ°νκ² λμ²ν μ μμ§λ§ νμ΅μ νμν λ°μ΄ν°μ μ»΄ν¨ν μμμ΄ λ§μ΄ νμν©λλ€. μλ₯Ό λ€λ©΄ BERTScore, BLEURT λ±μ΄ μμ΅λλ€.
- μΈκ° μ€μ¬μ νκ° μ²λ: μΈκ° νκ°μλ€μ΄ μμ±λ ν μ€νΈμ νμ§μ΄λ μ μ©μ±μ μ§μ μ μννκ±°λ λΉκ΅νλ λ°©λ²μ λλ€. μ΄ λ°©λ²μ μΈμ΄μ λ€μμ±μ λ―Όκ°νκ³ μ λ’°ν μ μμ§λ§ λΉμ©μ΄ λ§μ΄ λ€κ³ μκ°μ΄ μ€λ 걸립λλ€.