LLM as a Judge / QE for translation

Dear Huggingface community,

It’s been sort of a holy grail to evaluate machine translation without reference translations.

Since the early days of statistical NLP with vectors and sentence/document similarities and features based Quest
QuEst: A framework for translation quality estimation - ACL Anthology , we’ve been harping on referenceless evaluations.

Today we do have the sota models as a judge that can evaluate other NLP tasks close to “human parity”, but how close are we to truly trust worthy QE scores? Esp. given that the recent WMT25 results shows that good old reference based Chrf is still more trustworthy than most QE.

Is QE still an emergent sub field of MT? And would LLM as a judge kind of evaluation be lead to same tail chasing situation like in QE?

Does anyone have good and recent paper recommendations for related topics on these hopes and fears of QE / LaaJ?

Look forward to your recommendations!

1 Like

For now, I gathered resources.

Nice overview and answers, good point on Goodhart law too. Wondering why ain’t you posting here and putting it up on a dataset link instead =)

Partial signals are kinda not helpful in most cases, we might as well go back to feature/heuristic based since usually for LLM-as-a-Judge or even QE metrics we want the ultimate use of the metric to be actionable, i.e. either go back and patch errors in the model/data before deployment or stop the model from emitting results to users at inference time.

BTW, not to sound offensive but just out of curiosity, is the “gathered resources” generated by LLM or something? If so, which?

1 Like

Wondering why ain’t you posting here and putting it up on a dataset link instead =)

It’s just too big…:sweat_smile:

is the “gathered resources” generated by LLM or something?

Yeah.:sweat_smile:

LLMs excel at summarizing when forced to use prior knowledge (primarily conversation history)…
The process goes like this: First, I provide context to the LLM (copy-paste and modify your post → add my own memories or knowledge from my HDD → narrow down the target and have the LLM search to gather resources → if insufficient, request further reasoning from the LLM →…). Then I have it summarize again. If the result doesn’t seem too nonsensical, I adopt it.

If so, which?

GPT-5.1 Thinking. But Gemini 3 (released today) seems maybe far more powerful.

1 Like

One aspect I like about the LLM-as-a-judge (with reasoning) type of QE is that it gives us hope to go beyond human performance in evaluation.

1 Like