Towards better measurement of progress in machine translation: evaluation and meta-evaluation

Proietti, Lorenzo

Reliable evaluation is central to Machine Learning. Without it, progress cannot be measured, claims of improvement are hard to verify, and research is misdirected. This thesis focuses on Machine Translation (MT), where the goal is to produce a translation that is faithful to the source meaning and fluent in the target language. As modern MT systems reach very high quality, often relying on Large Language Models, the differences between top systems become subtle and progress increasingly difficult to measure. Compounding this, MT is inherently challenging to evaluate: a single source text may admit many equally acceptable translations, and errors can take many forms. Collectively, these conditions affect both human and automatic evaluation. Human annotators and automatic metrics must assess translation quality in the presence of many acceptable outputs and diverse error patterns. Meta-evaluation, in turn, must determine how well metric scores align with human judgments and how informative and accurate they are for comparing MT systems; if it rewards metrics for the wrong reasons, it can distort metric rankings and misdirect research and practices in automatic evaluation. In light of these challenges, this thesis examines weaknesses in current methodologies for evaluating translations and for evaluating the evaluators, and proposes solutions that keep measurements of progress fair, robust, and informative. First, we identify fundamental flaws in widely used MT meta-evaluation strategies, showing that they can distort metric rankings by inadvertently rewarding metrics for the wrong reasons, and we introduce a revised protocol that corrects these effects. Second, we anchor metric rankings to human performance by estimating human baselines and placing metrics and humans on the same scale, making performance headroom explicit, and showing that claims of human parity are fragile. Third, we show that for modern high-performing MT systems, standard benchmarks have become too easy, leading to performance saturation and blunting progress measurement, and we introduce a methodology for identifying hard-to-translate source texts that we use to construct more difficult benchmarks that resist saturation and preserve headroom. Finally, we reframe automatic evaluation as a pairwise comparison between two candidate translations, improving accuracy on high-quality outputs while remaining efficient. Taken together, these contributions keep progress in MT measurable and trustworthy, that is, meta-evaluation reflecting genuine evaluative ability, metric rankings anchored to human performance, benchmarks that remain challenging as systems improve, and automatic evaluation with resolution on high-quality outputs.

Towards better measurement of progress in machine translation: evaluation and meta-evaluation

PROIETTI, LORENZO

2026

Abstract

Reliable evaluation is central to Machine Learning. Without it, progress cannot be measured, claims of improvement are hard to verify, and research is misdirected. This thesis focuses on Machine Translation (MT), where the goal is to produce a translation that is faithful to the source meaning and fluent in the target language. As modern MT systems reach very high quality, often relying on Large Language Models, the differences between top systems become subtle and progress increasingly difficult to measure. Compounding this, MT is inherently challenging to evaluate: a single source text may admit many equally acceptable translations, and errors can take many forms. Collectively, these conditions affect both human and automatic evaluation. Human annotators and automatic metrics must assess translation quality in the presence of many acceptable outputs and diverse error patterns. Meta-evaluation, in turn, must determine how well metric scores align with human judgments and how informative and accurate they are for comparing MT systems; if it rewards metrics for the wrong reasons, it can distort metric rankings and misdirect research and practices in automatic evaluation. In light of these challenges, this thesis examines weaknesses in current methodologies for evaluating translations and for evaluating the evaluators, and proposes solutions that keep measurements of progress fair, robust, and informative. First, we identify fundamental flaws in widely used MT meta-evaluation strategies, showing that they can distort metric rankings by inadvertently rewarding metrics for the wrong reasons, and we introduce a revised protocol that corrects these effects. Second, we anchor metric rankings to human performance by estimating human baselines and placing metrics and humans on the same scale, making performance headroom explicit, and showing that claims of human parity are fragile. Third, we show that for modern high-performing MT systems, standard benchmarks have become too easy, leading to performance saturation and blunting progress measurement, and we introduce a methodology for identifying hard-to-translate source texts that we use to construct more difficult benchmarks that resist saturation and preserve headroom. Finally, we reframe automatic evaluation as a pairwise comparison between two candidate translations, improving accuracy on high-quality outputs while remaining efficient. Taken together, these contributions keep progress in MT measurable and trustworthy, that is, meta-evaluation reflecting genuine evaluative ability, metric rankings anchored to human performance, benchmarks that remain challenging as systems improve, and automatic evaluation with resolution on high-quality outputs.

Scheda breve

Scheda completa

Scheda completa (DC)

	Facoltà/Dipartimento
	
				DIPARTIMENTO DI INGEGNERIA INFORMATICA, AUTOMATICA E GESTIONALE -ANTONIO RUBERTI-
			
	Corso di studio
	
				Altro corso di dottorato
			
	Data di pubblicazione
	
				28-gen-2026
			
	Lingua
	
				Inglese
			
	Relatore, Supervisor, Advisor o Tutor
	
				NAVIGLI, Roberto
			
	Correlatore, Controrelatore, Co-Supervisor,  Co-Tutor o Coordinatori
	
				GRISETTI, GIORGIO
			
	Nome Editore
	
				Università degli Studi di Roma "La Sapienza"
			
	Numero di pagine
	
				149
			
	Collezione di appartenenza
	
				Università degli Studi di Roma La Sapienza

File in questo prodotto:

File	Dimensione	Formato
Tesi_dottorato_Proietti.pdf accesso aperto Licenza: Creative Commons Dimensione 4.24 MB Formato Adobe PDF Visualizza/Apri	4.24 MB	Adobe PDF	Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/361831

Il codice NBN di questa tesi è URN:NBN:IT:UNIROMA1-361831