When LLMs Read Tables Carelessly: Measuring and Reducing Data Referencing Errors
Abstract
Large language models exhibit data referencing errors when processing tables, which can be mitigated through critic-based filtering and rejection sampling, with a lightweight 4B-parameter model achieving high detection accuracy.
While large language models (LLMs) perform well on table tasks, they still make data referencing errors (DREs), i.e., incorrectly citing or omitting table values, despite understanding the table structure. Beyond final-answer accuracy, DREs directly compromise the correctness and reliability of intermediate reasoning steps. Yet prior studies have only offered limited, small-scale analyses. In this work, we present the first systematic evaluation of tabular data referencing errors across different models and tasks. Our results show that DREs occur across all tested models (1.7B to 20B parameters). Furthermore, we demonstrate that incorporating data referencing as a critic significantly improves answer accuracy up to 12.0%, through critic-based filtering and rejection sampling. Finally, we trained a lightweight 4B-parameter critic model that achieves an average F1 score of 78.2% in detecting both in-distribution and out-of-distribution DREs, and effectively assists inference for larger models.
Community
We define, evaluate and mitigate data referencing errors in table tasks.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Synthetic Contrastive Reasoning for Multi-Table Q&A (2026)
- Teaching Language Models to Check Grounded Claim Factuality with Human Test-Taking Strategies (2026)
- Hint Tuning: Less Data Makes Better Reasoners (2026)
- Learning More from Less: Exploiting Counterfactuals for Data-Efficient Chart Understanding (2026)
- TABVERSE: Benchmarking Cross-Format Table Understanding in LLMs and VLMs (2026)
- Confidence-Aware Alignment Makes Reasoning LLMs More Reliable (2026)
- SLMJury: Can Small Language Models Judge as Well as Large Ones? (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2606.32029 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper