You Cannot Feed Two Birds with One Score: the Accuracy-Naturalness Tradeoff in Translation


Gergely Flamich

07/08/2025

gergely-flamich.github.io

In Collaboration With

the goals of translation

Accuracy

Translation carries the meaning of the source text

Naturalness

Translation sounds good in target language

measuring translation quality

Ingredients

  • Have dataset of source text
  • Have dataset of human reference translations
  • Translation system \(Q_{y \mid x}\) to translate source text

Human Evaluations

Multidimensional Quality Metrics (MQM) [2]

Classic Automated Metrics

Examples: BLEU, chrF

Purely symbolic: compare to human reference translation

✅ simple

❌ limited by the human reference

Neural Metrics

Examples: MetricX, Comet

Large language model-based: predict MQM scores

✅ Jointly assess accuracy and naturalness

Feeding Two birds with one score

Feeding Two birds with one score

Should we assess accuracy and naturalness jointly?

Table from WMT24 findings paper [1].

where we are

  • No formal notion of accuracy or naturalness
  • Their interaction not well understood
  • Community uses single-score assessments
  • Results don't seem to align with human evals

Information theory to the rescue

💡 Blau and Michaeli [3] already solved a similar issue!

Accuracy \(\leftrightarrow\) Distortion

Naturalness \(\leftrightarrow\) Realism/Distinguishability

Ingredients

  • Source sentence \(x\)
  • Translation system \(Q_{y \mid x}\)
  • Reference translation \(y^r \sim Q_{y \mid x}^{\mathrm{human}}\)
  • Hypothesis/candidate \(y^c \sim Q_{y \mid x}\)

Accuracy

  • Distortion metric: \(\Delta(x, y^r, y^c) \geq 0\)
  • Accuracy: average negative distortion

Naturalness: Intuition

Naturalness: Definition I

  • Identify naturalness with distinguishability
  • We pick a reference distribution \(R_y\)

Naturalness: Definition II

  • Translation marginal \(Q_y\)
  • Distinguishability is equivalent to statistical distance \(D(Q_y, R_y)\)

\[ N(Q_{y \mid x}) = -D(Q_y, R_y) \]

the divergence

Let \(P(f) = \mathbb{E}_{X \sim P}[f(X)]\)

Integral probability metric (IPM):

\[ \mathrm{IPM}_{\mathcal{F}}[Q \Vert P] = \sup_{f \in \mathcal{F}}\lvert Q(f) - P(f) \rvert \]

Optimal critic \(f^\star\)

\[ f^\star = \mathrm{argmax}_{f \in \mathcal{F}}\lvert Q(f) - P(f) \rvert \]

The Equivalence

Set \(\epsilon = \mathbb{P}[b = 1]\). Then:

\begin{gather*} L(b, \alpha) = \begin{cases} -\alpha / \epsilon & \text{if } b = 1 \\ \alpha / (1 - \epsilon) & \text{if } b = -1 \end{cases} \end{gather*}

Then:

\begin{gather*} R_{\mathcal{F}}^L = \inf_{f \in \mathcal{F}}\mathbb{E}[L(\mathbf{b}, f(\mathbf{y}_{\mathbf{b}}))] = -\mathrm{IPM}_{\mathcal{F}}[Q \Vert P] \end{gather*}

Are perfect accuracy and naturalness the same?

Perfect naturalness \(\Rightarrow\) perfect accuracy?

Perfect accuracy \(\Rightarrow\) perfect naturalness?

  • No, according to Blau and Michaeli's setup [3]
  • No, according to our setup [4]

what is the tradeoff like?

Accuracy-naturalness function:

  • \(A(N)\) is non-increasing
  • If \(D\) convex in first slot, then \(A(N)\) concave

Approximating the curve

💡 Use LLM scores to judge the translations!

🤔 Does this correspond to some \(D(Q, P)\)?

EN \(\to\) DE: I’ve wanted to fly since I was a child.

where is the SOTA?

Close to the curve, accuracy and naturalness anti-correlate

where is the SOTA?

the issue and the fix

\(\mathrm{IPM}_{\mathcal{F}}[Q \Vert P] = \sup_{f \in \mathcal{F}}\lvert Q(f) - P(f) \rvert\)

❌ \(f^\star\) depends on \(Q\)!

Fix: average instead of maximising

Let \(f \sim \mathcal{P}\)

\begin{align*} D_p(Q, P \mid \mathcal{P}) &= \mathcal{P}(\vert Q - P \vert^p)^{1/p} \\ &= \mathbb{E}_{f \sim \mathcal{P}}[\vert Q(f) - P(f) \vert^p]^{1/p} \end{align*}

Some interesting properties

✅ \(D_p\) a metric under some sensible conditions

✅ Can estimate without knowing \(Q\): \[ D_1(Q, P \mid \mathcal{P}) \approx \frac{1}{N} \sum_{n = 1}^N \left(\sum_{m = 1}^{M_Q}\frac{f_n(X_m)}{M_Q} - \sum_{m = 1}^{M_P}\frac{f_n(Y_m)}{M_P}\right) \]

✅ When \(\mathcal{P}\) is a GP, \(D_2\) corresponds to MMD

contributions

  • Proposed a formal definition of accuracy and naturalness
  • Extended the theory of Blau and Michaeli
  • Showed that tradeoff must exist in practice
  • Assessed the performance of the current state-of-the-art
  • Showed connection between no-reference metrics and statistical distances

References I

  • [1] Kocmi et al. (2024). Findings of the WMT24 general machine translation shared task: the LLM era is here but mt is not solved yet. In Proceedings of the Ninth Conference on Machine Translation (pp. 1-46).
  • [2] Freitag et al. (2021). Experts, errors, and context: A large-scale study of human evaluation for machine translation. Transactions of the Association for Computational Linguistics, 9, 1460-1474.

References II

  • [3] Yochai Blau and Tomer Michaeli. The perception-distortion tradeoff. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6228–6237, 2018.
  • [4] F et al. (2025). You Cannot Feed Two Birds with One Score: the Accuracy-Naturalness Tradeoff in Translation. arXiv preprint arXiv:2503.24013.

References III

  • [5] Sriperumbudur et al. (2009). On integral probability metrics,φ-divergences and binary classification. arXiv preprint arXiv:0901.2698.