LLM as a Judge / QE for translation

alvations · November 18, 2025, 4:26pm

Dear Huggingface community,

It’s been sort of a holy grail to evaluate machine translation without reference translations.

Since the early days of statistical NLP with vectors and sentence/document similarities and features based Quest
QuEst: A framework for translation quality estimation - ACL Anthology , we’ve been harping on referenceless evaluations.

Today we do have the sota models as a judge that can evaluate other NLP tasks close to “human parity”, but how close are we to truly trust worthy QE scores? Esp. given that the recent WMT25 results shows that good old reference based Chrf is still more trustworthy than most QE.

Is QE still an emergent sub field of MT? And would LLM as a judge kind of evaluation be lead to same tail chasing situation like in QE?

Does anyone have good and recent paper recommendations for related topics on these hopes and fears of QE / LaaJ?

Look forward to your recommendations!

John6666 · November 19, 2025, 7:47am

For now, I gathered resources.

alvations · November 19, 2025, 8:41am

Nice overview and answers, good point on Goodhart law too. Wondering why ain’t you posting here and putting it up on a dataset link instead =)

Partial signals are kinda not helpful in most cases, we might as well go back to feature/heuristic based since usually for LLM-as-a-Judge or even QE metrics we want the ultimate use of the metric to be actionable, i.e. either go back and patch errors in the model/data before deployment or stop the model from emitting results to users at inference time.

BTW, not to sound offensive but just out of curiosity, is the “gathered resources” generated by LLM or something? If so, which?

John6666 · November 19, 2025, 8:46am

Wondering why ain’t you posting here and putting it up on a dataset link instead =)

It’s just too big…

John6666 · November 19, 2025, 8:58am

is the “gathered resources” generated by LLM or something?

Yeah.

LLMs excel at summarizing when forced to use prior knowledge (primarily conversation history)…
The process goes like this: First, I provide context to the LLM (copy-paste and modify your post → add my own memories or knowledge from my HDD → narrow down the target and have the LLM search to gather resources → if insufficient, request further reasoning from the LLM →…). Then I have it summarize again. If the result doesn’t seem too nonsensical, I adopt it.

If so, which?

GPT-5.1 Thinking. But Gemini 3 (released today) seems maybe far more powerful.

pinzhenchen · December 4, 2025, 11:37am

One aspect I like about the LLM-as-a-judge (with reasoning) type of QE is that it gives us hope to go beyond human performance in evaluation.

Topic		Replies	Views
LangCheck: a multi-lingual toolkit to evaluate LLM applications Show and Tell	0	351	March 10, 2024
Reasoning LLM Benchmarking 🤗Transformers	2	3270	March 24, 2025
QA based summarization and summary evaluation Beginners	0	329	November 9, 2023
Setting up Custom LLM Leaderboard for other languages 🤗Hub	0	307	March 10, 2024
The Category of Model-based Translation Evaluation Methods Beginners	0	233	March 18, 2022

LLM as a Judge / QE for translation

Related topics