@witcheer on Hugging Face: "new dataset: which local LLM best *drives an agent*? benchmarked 4 models for…"

Post

140

new dataset: which local LLM best *drives an agent*? benchmarked 4 models for pairing with Hermes Agent (@NousResearch ) - a CodeAct agent that writes python to call its tools. RTX 5090, llama.cpp. two phases, hybrid:

>>> phase A (synthetic): scored 4 axes — code-as-action, long-context, instruction-following under Hermes' real ~3.5K-token prompt, multi-step loops. top was a near-tie (within noise): an 18B frankenmerge (Qwopus) edged Qwen3.6-27B, and Hermes' own 36B came LAST.

>>> phase B (real harness): installed Hermes, ran the top 3 through 14 multi-step tasks x3 repeats. the tie broke — and an efficiency gap appeared:

Qwen3.6-27B    100%   | 3.0 turns | 364 tok
Qwopus-18B     85.7%  | 3.6 turns | 870 tok
Nemotron-30B   85.7%  | 4.4 turns | 1334 tok

Qwen is perfect AND 2.4-3.7x more token-efficient — something a synthetic test can't see (only the real agent loop can). verdict: Qwen3.6-27B for local Hermes.

dataset: witcheer/hermes-pairing-bench
collection: witcheer/rtx-5090-benchmark-rig-6a17e365b534abb474250e11

Join the conversation