AI Papers Reader

Personalized digests of latest AI research

View on GitHub

Word-Level Quality Estimation for Human Post-Editing: A Real-World Evaluation

Machine translation (MT) is rapidly transforming professional translation workflows. While MT systems are improving, human post-editing remains crucial for ensuring high-quality translations, especially in challenging domains. A new study, QE4PE (Quality Estimation for Post-Editing), published on arXiv, investigates how word-level quality estimation (QE) can improve the speed and efficiency of this human post-editing process.

The QE4PE study involved 42 professional translators working on English-Italian and English-Dutch translations of biomedical and social media texts. The researchers compared four different methods of highlighting potential errors in the MT output:

  1. No Highlight: The translators received the MT output without any error highlighting. This served as a baseline.
  2. Oracle: The translators received the MT output with error highlights based on the consensus of three professional post-editors who had previously edited the same texts. This represented the ideal scenario for error highlighting.
  3. Supervised: The highlights were generated by a state-of-the-art supervised QE model (XCOMET-XXL), trained on human error annotations.
  4. Unsupervised: The highlights leveraged the uncertainty of the MT model during its generation process. This method doesn’t rely on human-annotated data for training.

Translators’ work was logged to measure editing time and productivity. The quality of both the original MT and the post-edited translations was assessed using human annotations and automatic metrics.

The study revealed several key findings:

  • Highlight modality’s effect on productivity is highly variable: The impact of the different highlight methods on translation speed was inconsistent, depending on factors like the source and target language pair, the text domain, and the individual translator’s speed. For instance, highlights sometimes sped up editing in social media texts, but often slowed it down for biomedical texts. This highlights a gap between the accuracy of QE systems (measured in previous research) and their practical usability in real-world settings.

  • High agreement on edits, low agreement on highlighted spans: While the different highlighting modalities had relatively low agreement on which words were problematic, there was substantial agreement on what edits the translators made, suggesting that high accuracy in identifying specific error spans may not be necessary for efficient post-editing assistance.

  • The “oracle” highlights provided the most benefit, but not always: The ideal “oracle” highlights, based on expert consensus, were usually effective but not uniformly so. Even the ideal highlights sometimes increased editing time, indicating the complexity of human cognition in the post-editing process.

  • High-quality highlights improved post-editing quality: Using highlights, even imperfect ones, generally led to improvements in the final translation quality. However, automatic quality metrics sometimes struggled to accurately reflect these quality improvements, demonstrating the limitations of purely automated evaluation.

  • User experience matters: While the highlighted spans improved the accuracy of post-editing, many translators did not find the highlighting particularly helpful for improving their efficiency or confidence. The user experience and usability aspects are critical and must be considered alongside accuracy metrics.

The QE4PE study emphasizes that evaluating the impact of QE systems on human post-editing requires a holistic approach, going beyond traditional accuracy metrics to consider usability, productivity, and quality improvements in real-world workflows. The authors make all data and code available to facilitate further research in this area.