arxiv:2605.29307

GrepSeek: Training Search Agents for Direct Corpus Interaction

Published on May 28

· Submitted by

Alireza Salemi on Jun 1

University of Massachusetts Amherst

Upvote

Authors:

Alireza Salemi ,

Chang Zeng ,

Hamed Zamani

Abstract

GrepSeek enables efficient search agent training through direct corpus interaction using shell commands and a two-stage approach combining a cold-start dataset and group relative policy optimization.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Large Language Model (LLM) search agents have shown strong promise for knowledge-intensive language tasks through multiple rounds of reasoning and information retrieval. Most existing systems access information using a retriever that takes a keyword or natural language query and returns a ranked list of documents using an index of pre-computed document representations. In this work, we explore a complementary perspective in which the search agent treats the corpus itself as the search environment and finds evidence by issuing executable shell commands. We introduce GrepSeek, an optimized direct corpus interaction (DCI) search agent that trains a compact search agent to find, filter, and compose evidence from large text corpora. To address the instability of learning behavior directly with reinforcement learning on large corpora, we propose a two-stage training pipeline. First, we construct a cold-start dataset using an answer-aware Tutor and answer-blind Planner to generate verified, causally grounded search trajectories. Second, we refine the initialized policy with Group Relative Policy Optimization (GRPO), allowing the agent to improve its task-oriented search behavior through direct interaction with the corpus. To make DCI practical at scale, we further use a semantics-preserving sharded-parallel execution engine that accelerates shell-based retrieval by up to 7.6times while preserving byte-exact equivalence with sequential execution of the shell command. Experiments across seven open-domain question answering benchmarks show that GrepSeek achieves the strongest overall token-level F_1 and Exact Match. Our analysis also highlights the limitations of purely lexical interaction on queries with substantial surface-form variation, suggesting DCI as a practical and competitive method for search agents that can complement existing retrieval paradigms in the real world.

View arXiv page View PDF GitHub 33 Add to collection

Community

alireza7

Paper author Paper submitter 3 days ago

Traditional indexing for large language model personalization and search can be rigid and resource-intensive. In this paper, we explore a different route: Direct Corpus Interaction. We show how training models to formulate and execute complex shell pipelines can unlock highly scalable, accurate search over millions of documents directly from the terminal. Check out the repository for the full implementation!

librarian-bot

2 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

avahal

about 5 hours ago

that shard-parallel, semantics-preserving execution engine that guarantees byte-exact equivalence with sequential shell runs is the part i keep coming back to. it feels like a clean way to scale direct corpus interaction without embeddings, but i wonder about practical brittleness with non-deterministic shell steps or dynamic corpus changes. the vibe of trading offline indexing for live execution is compelling for lexical tracing, though i’d love to see a latency vs shard count ablation. btw the arxivlens breakdown helped me parse the method details, especially the shard orchestration and the correctness vibe. what happens if a shell command has side effects or emits non-deterministic output; how would you ensure robustness in those edge cases?

alireza7

Paper author about 2 hours ago

Thanks for the thoughtful read! I agree that this is exactly the main robustness question for DCI-style systems.

The byte-exact guarantee in GrepSeek is not meant to cover arbitrary shell programs. It is scoped to corpus-search pipelines whose behavior can be reconstructed exactly from shard-local execution. The engine first classifies the command: if the pipeline is shard-independent and deterministic, it can be parallelized; if it involves global/cross-line state, unsupported structure, or anything unsafe, the system conservatively falls back to sequential execution.

For side effects, the clean practical answer is to treat the shell interface as a restricted, sandboxed search language rather than a fully general shell. In production, I would enforce read-only corpus mounts, a whitelist of supported commands, no redirects or file writes, no network access, no access to time/randomness/environment-dependent outputs, and resource/time limits. So commands like rm, > writes, date, random sampling, or process-dependent outputs should be rejected rather than parallelized.

For dynamic corpora, I actually see DCI as having an advantage over embedding-based systems. Since the agent interacts directly with the corpus at execution time, the system can load a new or updated corpus immediately without requiring an offline indexing or embedding-refresh step. The main requirement is to define the execution semantics clearly: each query or multi-step trajectory should run against a fixed corpus snapshot/version so the results remain consistent while the corpus may continue changing in the background.

So the robustness story is mostly: narrow the executable contract, sandbox aggressively, normalize the environment, and only parallelize commands whose outputs can be deterministically reduced to the exact sequential result. Anything outside that contract should either fall back to sequential execution or be rejected as unsupported.