Papers
arxiv:2601.22975

Golden Goose: A Simple Trick to Synthesize Unlimited RLVR Tasks from Unverifiable Internet Text

Published on Jan 30
· Submitted by
Ximing Lu
on Feb 2
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

Golden Goose synthesizes unlimited RLVR tasks from unverifiable internet text by creating multiple-choice question-answering versions of fill-in-the-middle tasks, enabling large-scale training and achieving state-of-the-art results in cybersecurity and other domains.

AI-generated summary

Reinforcement Learning with Verifiable Rewards (RLVR) has become a cornerstone for unlocking complex reasoning in Large Language Models (LLMs). Yet, scaling up RL is bottlenecked by limited existing verifiable data, where improvements increasingly saturate over prolonged training. To overcome this, we propose Golden Goose, a simple trick to synthesize unlimited RLVR tasks from unverifiable internet text by constructing a multiple-choice question-answering version of the fill-in-the-middle task. Given a source text, we prompt an LLM to identify and mask key reasoning steps, then generate a set of diverse, plausible distractors. This enables us to leverage reasoning-rich unverifiable corpora typically excluded from prior RLVR data construction (e.g., science textbooks) to synthesize GooseReason-0.7M, a large-scale RLVR dataset with over 0.7 million tasks spanning mathematics, programming, and general scientific domains. Empirically, GooseReason effectively revives models saturated on existing RLVR data, yielding robust, sustained gains under continuous RL and achieving new state-of-the-art results for 1.5B and 4B-Instruct models across 15 diverse benchmarks. Finally, we deploy Golden Goose in a real-world setting, synthesizing RLVR tasks from raw FineWeb scrapes for the cybersecurity domain, where no prior RLVR data exists. Training Qwen3-4B-Instruct on the resulting data GooseReason-Cyber sets a new state-of-the-art in cybersecurity, surpassing a 7B domain-specialized model with extensive domain-specific pre-training and post-training. This highlights the potential of automatically scaling up RLVR data by exploiting abundant, reasoning-rich, unverifiable internet text.

Community

Paper submitter

TL;DR: We introduce Golden Goose 🦢, a simple method that synthesizes unlimited RLVR tasks from unverifiable internet text by constructing multiple-choice fill-in-the-middle problems. This enables the use of reasoning-rich unverifiable corpora typically excluded from prior RLVR data curation (e.g., science textbooks), allowing RL to scale beyond the data saturation of existing RLVR datasets and achieving new SoTA results on 1.5B and 4B-Instruct models. In a real-world deployment to cybersecurity, where no prior RLVR data exists, Golden Goose synthesizes RLVR tasks from raw FineWeb scrapes, yielding a new SoTA 4B cybersecurity LLM that surpasses a 7B domain-specialized model.

Turning Pre-Train Datasets into the best RLVR tasks! 🎉

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.22975 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.22975 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.22975 in a Space README.md to link it from this page.

Collections including this paper 3