DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval
Abstract
A lightweight retrieval model called DARE incorporates data distribution information into function representations to improve R package retrieval, achieving superior performance over existing embedding models while enabling more reliable statistical analysis through an R-oriented LLM agent.
Large Language Model (LLM) agents can automate data-science workflows, but many rigorous statistical methods implemented in R remain underused because LLMs struggle with statistical knowledge and tool retrieval. Existing retrieval-augmented approaches focus on function-level semantics and ignore data distribution, producing suboptimal matches. We propose DARE (Distribution-Aware Retrieval Embedding), a lightweight, plug-and-play retrieval model that incorporates data distribution information into function representations for R package retrieval. Our main contributions are: (i) RPKB, a curated R Package Knowledge Base derived from 8,191 high-quality CRAN packages; (ii) DARE, an embedding model that fuses distributional features with function metadata to improve retrieval relevance; and (iii) RCodingAgent, an R-oriented LLM agent for reliable R code generation and a suite of statistical analysis tasks for systematically evaluating LLM agents in realistic analytical scenarios. Empirically, DARE achieves an NDCG at 10 of 93.47%, outperforming state-of-the-art open-source embedding models by up to 17% on package retrieval while using substantially fewer parameters. Integrating DARE into RCodingAgent yields significant gains on downstream analysis tasks. This work helps narrow the gap between LLM automation and the mature R statistical ecosystem.
Community
We introduce DARE, an embedding model for improving LLM Agents on R package retrieval and downstream statistical analysis tasks. DARE outperforms open-sourced embedding models on R retrieval with higher efficiency and accuracy.
Paper: https://arxiv.org/abs/2603.04743
Website: https://ama-cmfai.github.io/DARE_webpage/
Model: https://huggingface.co/Stephen-SMJ/DARE-R-Retriever
Database: https://huggingface.co/datasets/Stephen-SMJ/RPKB
arXivLens breakdown of this paper ๐ https://arxivlens.com/PaperView/Details/dare-aligning-llm-agents-with-the-r-statistical-ecosystem-via-distribution-aware-retrieval-1339-e232a0b2
- Executive Summary
- Detailed Breakdown
- Practical Applications
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper