Reliable evaluation is central to Machine Learning. Without it, progress cannot be measured, claims of improvement are hard to verify, and research is misdirected. This thesis focuses on Machine Translation (MT), where the goal is to produce a translation that is faithful to the source meaning and fluent in the target language. As modern MT systems reach very high quality, often relying on Large Language Models, the differences between top systems become subtle and progress increasingly difficult to measure. Compounding this, MT is inherently challenging to evaluate: a single source text may admit many equally acceptable translations, and errors can take many forms. Collectively, these conditions affect both human and automatic evaluation. Human annotators and automatic metrics must assess translation quality in the presence of many acceptable outputs and diverse error patterns. Meta-evaluation, in turn, must determine how well metric scores align with human judgments and how informative and accurate they are for comparing MT systems; if it rewards metrics for the wrong reasons, it can distort metric rankings and misdirect research and practices in automatic evaluation. In light of these challenges, this thesis examines weaknesses in current methodologies for evaluating translations and for evaluating the evaluators, and proposes solutions that keep measurements of progress fair, robust, and informative. First, we identify fundamental flaws in widely used MT meta-evaluation strategies, showing that they can distort metric rankings by inadvertently rewarding metrics for the wrong reasons, and we introduce a revised protocol that corrects these effects. Second, we anchor metric rankings to human performance by estimating human baselines and placing metrics and humans on the same scale, making performance headroom explicit, and showing that claims of human parity are fragile. Third, we show that for modern high-performing MT systems, standard benchmarks have become too easy, leading to performance saturation and blunting progress measurement, and we introduce a methodology for identifying hard-to-translate source texts that we use to construct more difficult benchmarks that resist saturation and preserve headroom. Finally, we reframe automatic evaluation as a pairwise comparison between two candidate translations, improving accuracy on high-quality outputs while remaining efficient. Taken together, these contributions keep progress in MT measurable and trustworthy, that is, meta-evaluation reflecting genuine evaluative ability, metric rankings anchored to human performance, benchmarks that remain challenging as systems improve, and automatic evaluation with resolution on high-quality outputs.

Towards better measurement of progress in machine translation: evaluation and meta-evaluation

PROIETTI, LORENZO
2026

Abstract

Reliable evaluation is central to Machine Learning. Without it, progress cannot be measured, claims of improvement are hard to verify, and research is misdirected. This thesis focuses on Machine Translation (MT), where the goal is to produce a translation that is faithful to the source meaning and fluent in the target language. As modern MT systems reach very high quality, often relying on Large Language Models, the differences between top systems become subtle and progress increasingly difficult to measure. Compounding this, MT is inherently challenging to evaluate: a single source text may admit many equally acceptable translations, and errors can take many forms. Collectively, these conditions affect both human and automatic evaluation. Human annotators and automatic metrics must assess translation quality in the presence of many acceptable outputs and diverse error patterns. Meta-evaluation, in turn, must determine how well metric scores align with human judgments and how informative and accurate they are for comparing MT systems; if it rewards metrics for the wrong reasons, it can distort metric rankings and misdirect research and practices in automatic evaluation. In light of these challenges, this thesis examines weaknesses in current methodologies for evaluating translations and for evaluating the evaluators, and proposes solutions that keep measurements of progress fair, robust, and informative. First, we identify fundamental flaws in widely used MT meta-evaluation strategies, showing that they can distort metric rankings by inadvertently rewarding metrics for the wrong reasons, and we introduce a revised protocol that corrects these effects. Second, we anchor metric rankings to human performance by estimating human baselines and placing metrics and humans on the same scale, making performance headroom explicit, and showing that claims of human parity are fragile. Third, we show that for modern high-performing MT systems, standard benchmarks have become too easy, leading to performance saturation and blunting progress measurement, and we introduce a methodology for identifying hard-to-translate source texts that we use to construct more difficult benchmarks that resist saturation and preserve headroom. Finally, we reframe automatic evaluation as a pairwise comparison between two candidate translations, improving accuracy on high-quality outputs while remaining efficient. Taken together, these contributions keep progress in MT measurable and trustworthy, that is, meta-evaluation reflecting genuine evaluative ability, metric rankings anchored to human performance, benchmarks that remain challenging as systems improve, and automatic evaluation with resolution on high-quality outputs.
28-gen-2026
Inglese
NAVIGLI, Roberto
GRISETTI, GIORGIO
Università degli Studi di Roma "La Sapienza"
149
File in questo prodotto:
File Dimensione Formato  
Tesi_dottorato_Proietti.pdf

accesso aperto

Licenza: Creative Commons
Dimensione 4.24 MB
Formato Adobe PDF
4.24 MB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/361831
Il codice NBN di questa tesi è URN:NBN:IT:UNIROMA1-361831