LEGO-Eval: Towards Fine-Grained Evaluation on Synthesizing 3D Embodied Environments with Tool Augmentation Paper β’ 2511.03001 β’ Published Nov 4, 2025 β’ 47
One Missing Piece for Open-Source Reasoning Models: A Dataset to Mitigate Cold-Starting Short CoT LLMs in RL Paper β’ 2506.02338 β’ Published Jun 3, 2025 β’ 5
The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models Paper β’ 2406.05761 β’ Published Jun 9, 2024 β’ 3
Evaluating Robustness of Reward Models for Mathematical Reasoning Paper β’ 2410.01729 β’ Published Oct 2, 2024
Do LLMs Have Distinct and Consistent Personality? TRAIT: Personality Testset designed for LLMs with Psychometrics Paper β’ 2406.14703 β’ Published Jun 20, 2024 β’ 2
Web-Shepherd: Advancing PRMs for Reinforcing Web Agents Paper β’ 2505.15277 β’ Published May 21, 2025 β’ 104
Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation Paper β’ 2410.13232 β’ Published Oct 17, 2024 β’ 44
Coffee-Gym: An Environment for Evaluating and Improving Natural Language Feedback on Erroneous Code Paper β’ 2409.19715 β’ Published Sep 29, 2024 β’ 10
VerifiNER: Verification-augmented NER via Knowledge-grounded Reasoning with Large Language Models Paper β’ 2402.18374 β’ Published Feb 28, 2024 β’ 2
TUTORING: Instruction-Grounded Conversational Agent for Language Learners Paper β’ 2302.12623 β’ Published Feb 24, 2023 β’ 2
Language Models as Compilers: Simulating Pseudocode Execution Improves Algorithmic Reasoning in Language Models Paper β’ 2404.02575 β’ Published Apr 3, 2024 β’ 50
Mind the Gap! Injecting Commonsense Knowledge for Abstractive Dialogue Summarization Paper β’ 2209.00930 β’ Published Sep 2, 2022 β’ 2
CoTEVer: Chain of Thought Prompting Annotation Toolkit for Explanation Verification Paper β’ 2303.03628 β’ Published Mar 7, 2023 β’ 2
Coffee: Boost Your Code LLMs by Fixing Bugs with Feedback Paper β’ 2311.07215 β’ Published Nov 13, 2023 β’ 3
Dialogue Chain-of-Thought Distillation for Commonsense-aware Conversational Agents Paper β’ 2310.09343 β’ Published Oct 13, 2023 β’ 2