Papers
arxiv:2606.06428

Reinforcement Learning Elicits Contextual Learning of Unseen Language Translation

Authors:
,
,
,
,

Abstract

Reinforcement learning approach enables large language models to translate unseen languages by leveraging in-context linguistic knowledge rather than memorizing specific languages.

Prior work has shown that large language models (LLMs) can translate unseen or low-resource languages by undergoing continued training or even by encoding a grammar book in their context. However, both methods typically overfit specific languages, with limited zero-shot transfer at test time. To translate extremely low-resource languages at scale, we argue that LLMs must acquire the meta-skill of utilizing in-context linguistic knowledge rather than memorizing specific languages. In this paper, we propose a reinforcement learning (RL) approach to unseen language translation given rich linguistic context, using a surface-level translation metric (chrF) as the reward. Empirically, despite the lightweight reward, our RL-trained models effectively extract and apply relevant linguistic information from the provided context, leading to better translations on completely unseen languages than in-context learning or supervised fine-tuning. Our analyses suggest that outcome-based RL can extend beyond conventional reasoning tasks like math and coding to serve as a recipe for language learning from context.

Community

Paper submitter

In this paper, we propose a reinforcement learning approach to unseen language translation given
rich linguistic context, we argue that LLMs can acquire the meta-skill of utilizing context knowledge rather than memorizing specific languages thus have generalization ability to unseen languages.

The paper shows that reinforcement learning with a surface-level chrF reward can train models to extract and apply linguistic information from rich in-context descriptions, outperforming both in-context learning and supervised fine-tuning on completely unseen languages. This suggests outcome-based RL enables acquisition of a transferable meta-skill rather than language-specific memorization.

How might the chrF reward’s emphasis on surface overlap shape which aspects of the provided linguistic context the model learns to prioritize during RL?

I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:
https://researchpod.app/episode/07923618-2812-4b7c-b9ec-5e9a94a7451d

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.06428
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.06428 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.06428 in a Space README.md to link it from this page.

Collections including this paper 1