πŸͺ΄ Hayul's digital garden

Search

Search IconIcon to open search

Evaluating NLG systems

Last updated Mar 15, 2023 Edit Source

Note

Don’t compare human evaluation scores across differently-conducted studies. Even if they claim to evaluate the same dimensions!

Evaluation: Takeaways

  • Content overlap metrics provide a good starting point for evaluating the quality of generated text, but they’re not good enough on their own.
  • Model-based metrics are can be more correlated with human judgment, but behavior is not interpretable.
  • Human judgments are critical.
    • Only ones that can directly evaluate factuality – is the model saying correct things?
    • But humans are inconsistent!
  • In many cases, the best judge of output quality is YOU!