Hermes Tool Loops in v1.1.5

#4
by StableQuant - opened

This is the discussion about the Tool Loops that occur in v1.1.5 when using Hermes. OpenHands and OpenCode seem to be unaffected. I think its a Harnessspecific thing. v.1.1.5 uses a robust tool-call error reccovery logic, it shouldnt happen but since it still does, I will look into this manually (installing Hermes myself and test)

So far, reported was tool calling loops and also cron tool call loops.

I was testing in hermes. With 1.1.5 hermes cannot write code to local file - timeouts 100% (15 tries out of 15). It can generate and display code, but probably cannot use some specific tools like write file or so.
Streaming, reasoning does not help (in logs - LLM finish task in 12s, hermes timeouts after 300s so clearly response lost somewhere)

I am using this with Opencode and latest VLLM and Qwen3.6 27b and noticing that sometimes it stops abruptly.
When it does from the text it looks its about to call some tool but it never does and just stops instead ...

@ABLomas
Did you noticed this on specific code or on random files happening?
I ask because I discovered an error yesterday myself, it coded fine for hours in OpenHands but when using a certain code part it becames stale. Probably the same reason.
Could you tell me if you used other templates sucessful managing this part, like froggeric v16 for example or did it happens with every template?

Ok, i spun up my own Hermes instance now. The tool calling is indeed totally broken with current template version, independed of editing code files. Im working on this now. No further information needed (but you can still post it if you want).

My setup:
--host 0.0.0.0 -fa 1 --fit-ctx 262144 --min-p 0.0 --fit 1 -b 2048 -ub 512 --no-mmap -ctk q8_0 -ctv q8_0 --jinja -m Qwen3.6-35B-A3B-IQ4_XS-4.15bpw.gguf --temp 0.6 --top-p 0.95 --top-k 20 --presence-penalty 0.0 --repeat-penalty 1.0 --chat-template-kwargs "{\"preserve_thinking\":true}" --no-mmproj -np 1 --alias Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf --reasoning-budget 4096 --metrics --reasoning-budget-message "[SYSTEM ALERT: Reasoning budget exceeded. I am stuck in a loop or overcomplicating. I must stop IMMEDIATELY and use the ask_followup_question tool to notify the user and ask for guidance.]" --chat-template-file qwen3.6_chat_template.txt -to 900

Model - https://huggingface.co/byteshape/Qwen3.6-35B-A3B-GGUF/blob/main/Qwen3.6-35B-A3B-IQ4_XS-4.15bpw.gguf

Hermes - v0.14.0 (2026.5.16)

Yesterday I tried installing version v1.1.5 on the Pi coding agent, tool seemed to be calling incorrectly. I don't know why. 😁
I had to revert to froggeric's v19

@StableQuant i just wanna say that froggeric's v19 template is the best for Hermes, v16 has loops as well.

@StableQuant I found a template that works with hermes and opencode without any issues so far - https://gist.github.com/fakezeta/9e8e039c60332fcb143c6e805558afe0
Maybe, it can help you to enhance your template.

Hi @szwedek thank you for posting it.
From the first look it seems indeed to be a clean template.

About my current process:
I came to the tentatively conclusion that this whole problem is not one a simple chat template can solve.
Its about language incompatibility basically.

Qwen was trained on normal text and heavily on XML based structured text.
Current Tool like Hermes expect it so use the JSON based OpenAI standard for tool calling.

The problem with Qwen is, its not trained on that. Its trained on XML. It will work, to some degree if you tell it explicitely to use JSON (like Hermes system prompt does) but as soon as you use thinking(Alibaba stated structured output in thinking mode is not supported) or experience high context load it falls back into its trained behaviour to output XML.

There are approaches to pack JSON tool calls into XML Tags which seem to be sucessfull to some point, but doesnt seem to fix it completly since JSON is hard for LLMs(complex brittle structure) to generate unlike XML(easy simple structure), even more when they are trained on mostly XML.

The template you linked tells the llm to make its tool calls XML based, which will work with vllm, which has its own "translation logic" built in when using a certain qwen3 xml parsing switch.

But for llama.cpp I expect it to become unstable as well under high context load.

My current thinking process goes away from a simple chat template solution but more into a chat template + middleware solution. If Qwen wants to speak XML but Tools want JSON, then why not just give both of them what they want?!
So, its more of a ecosystem incompatibility and also a geograpic/political dimension between Qwen and OpenAI when you think about it.

My current approach goes into a middleware + a clean Qwen Chat Template with XML that you can host for example with docker and does the translation process in miliseconds, not noticably to the user.
This would solve it but im still in early experimentation phases.

For the chat template you linked I expect it to also become unstable at some point since Hermes uses tool call IDs which Qwen natively doesnt uses and dont understands and begins to hallucinate them later at some point which confuses Hermes then. So the template might be stable to some low load usecases but probably become unstable as soon as you put high load work on it. But I might be wrong, keep me updated what your mileage is.

So to finalize my post: This whole template thing is a true rabbithole and its more than just a simple "non-deterministic to deterministic programmatic" question but rather about ecosystem compatibility.

For me the opposite is true. vllm and sglang have a worse chat template implementation.
I never was able to make vllm work properly with qwen 2.6, no matter the template. Either the thinking gets broken, or its stops abruptly or tool calls are wrong. Various different chat template issues.
With latest llama and this config Qwen3.6 27B MTP works perfectly for me:

  /app/llama-server \
      --hf-repo $MODEL_REPO \
      --hf-file $MODEL_FILE \
      --port 8000 \
      --alias Qwen3.6-27B \
      --jinja \
      --ctx-size 262144 \
      -ngl 99 \
      --flash-attn on \
      -ctk q8_0 -ctv q8_0 \
      --cont-batching \
      --parallel 3 \
      --batch-size 4096 \
      --metrics \
      --threads 4 \
      --mlock \
      --no-mmap \
      --spec-type draft-mtp \
      --spec-draft-n-max 3

@meualsan
I see. But did you tried the newest corresponding qwen parser flags with vllm and sglang?
Also to note, even if you use these flags, it doenst fixes dev role handling, tool calls inside thinking tags etc which is a flaw in the original Qwen template.
So you would need flags(to do XML to JSON translation) aswell as a fixed Qwen template with vllm and sglang.
For vllm for example there is the --tool-call-parser hermes flag for hermes.

Also sglang and vllm use guided decoding in their backend. Means: once a tool call is requested they force the model to output valid json via a predefined "library", unfitting tokens get rejected.
Which works better then just to tell the model to produce valid json, but also only until you reach high context load. It becomes unstable then aswell.

For llama.cpp there is currently no such thing at all.

llama.cpp seems to be faster with new integrations, for example I use a turboquant fork which expands KV-cache up to 8x vs Q4 cache, since two weeks, its a dream.
vllm doenst has this yet. From my understanding llama.cpp is more community driven and vllm and sglang is industry.
llama.cpp is more fluid and faster but vllm and sglang is business.
People get paid to include fixes there to make Qwen etc available to run in Hyperscaler and AI Clouds for stable business appliance.
With SGLang for example you could do multiuser usage on a single RTX3090 with a small Qwen modell and get combined decoding speeds in the 2,5k tokens/s. vs llama.cpp is mostly single user.
Up to 4 user its fine but any more gets slow really quickly. Total throughput a few hundred token/s vs 2-3k in Sglang with 16 users the same time.

Unfortunately for us VRAM poor xD thats not good news currently.
But It gets shared anyway. New stuff gets exchanged in both ways with time.

turboquant fork which expands KV-cache up to 8x vs Q4 cache

Please share 🙏

turboquant fork which expands KV-cache up to 8x vs Q4 cache

Please share 🙏

https://github.com/TheTom/llama-cpp-turboquant

-ctk q8_0 -ctv turbo4

https://github.com/TheTom/llama-cpp-turboquant

-ctk q8_0 -ctv turbo4

Thanks but that's not 8x vs 4 bit, more like x2.6 vs 16 bit

for example I use a turboquant fork which expands KV-cache up to 8x vs Q4 cache, since two weeks, its a dream.

I messed up two numbers here, sorry folks. Note to myself: Check the numbers more carefully in future before posting.
What I meant was: turbo4 offers lossless performance by 4x less KV Cache need compared to 16Bit unlike regular Q4 quanting.
Also, Google experimented with turboquant up to 1.5Q, which ist then the 8x I meant, but thats not released yet and its not losless.

For my use I use turbo4 for the keys and turbo3 for the values.
Thats like its recommended because Keys are more sensitive than values. But you also could use turbo3 for both.

I use the following Docker: https://hub.docker.com/r/dexogen/atomic-llama-cpp-turboquant
From repo: https://github.com/AtomicBot-ai/atomic-llama-cpp-turboquant

But to come back to my project:

Im making progress.
As I write here, Qwen3.6 in Cline updates my Testsuite to cover now also Tool Calls inside Thinking. I came to the result, you cant prohibit that model completly from doing it. Even with systemprompts in the Chat template. It just doenst have the awareness if its thinking or not. It will still happen. So my new pull is simple, when tool call inside thinking is detected, it gets stripped out from the thinking block and put inside normal text. So nothing gets confused or crashed. So far, im at ~560 tests for the project guys xD
Im looking forward to release something soon, but first it needs to undergo internal manual testing.
Which will take another few days.

Streaming process however is straightforward, unless its detected tool call tags or similar(it then gets handled inside the buffer) it pushes the text right through.
The whole conversion is done in milliseconds, using a small python based software stack and it will also be shipped as docker container, so you can just give it your LLM_Endpoint Port and your target port and the stream gets handled from there on right inside the docker container.
I decided against a Go based implementation since it would add complexity, also im familiar with python but not Go. Go would push the translation into sub-miliseconds area but it wouldnt be friendly to community additions later if someone wants to add something for example, also, if it takes 2 milliseconds to convert or 0.2 milliseconds is at the end not noticable to the user.

@szwedek
May I ask if this other template fixed it for you or if you still notice hangs from time to time?
Also want to note to everyone, froggeric just released v20.
He also slowly acknowledge the fact that it cant be solved for every tool with chat template only, from Error #24 "This happens because raw null bytes (\u0000) in tool outputs (like binary PE headers) truncate the prompt string inside C++ backends like llama.cpp. The template can't fix this directly, the harness needs to sanitize or strip null bytes in before returning tool outputs to the model. I'll close this thread for now."

As for the project, im done with mock test and basic testing and integration. First live tests went well.
Now I will have Claude have a last look over it and maybe fix some edge case bugs local Qwen cant find.

Then I will release v1.0 on github.

For me its still clear, the chat template cant fix tool calls since the model itself is trained on specific XML Tool Call format. Also you cant forbid it to open Tool Calls into thinking Tag. Froggeric just announced he did a systemp prompt adding for fixing that in v20. Which I already tried and doesnt work.
However I have much respect for his ongoing effort, even if the AI seems to tells him wrong diagnoses for some reasons of chat template faults. We are only as good as the tools we use.

So I just want to ask, you guys still experience problems, so solution needed or stable for now with Template XY?

As for my coding, Cline experienced also regular hiccups with local Qwen, its just the model. It needs the middleware badly till Qwen releases new models with upgraded training data quality. Until that happens (and everyone will hop on these new models) a middleware will do it.

At least for me.

What you guys think about it.

@StableQuant I found another template - https://huggingface.co/spiritbuun/buun-Qwen3.6-chat_template

It seems to be stable with Hermes and Opencode even if context grows.

@szwedek
Thats good news actually.
I have seen this project and template aswell and noticed it gaining popularity but didnt tested it out myself yet.
I will give it a try. If its stable for now thats good news for every Hermes user.
I still think complex code and expecially specific code files and context like 150k will cause it to start failing.
For example, might code fine with python (like it did for me with v1.1.5 and OpenHands) but as soon as it should edit a JSON file it collapsed.
120k context was usually the point where it started became unstable for me with general coding.
150-180k and it became unusable for coding.
But below 80k it was usually fine except specific code files

@szwedek
As I thought, XML and JSON demo task crashed the creating partly.
I tested it in Hermes.
Got "Chat format: peg-negative" in llama.cpp server from time to time.
Means: llama.cpp cant parse parts of this chat template. Thats because its made with mostly vllm in mind, vllm has full jinja2, llama.cpp has only minja, thats a minimalistic version, which makes chat template creation for llama hard in the first place.

Also it started looping on a error at around 30k context, started its fixing loop and didnt made progress repeating the exact same fixing steps again and again. "I found the problem, I need to fix file xy".
Till I stopped it at 60k manually.
Not sure if thats a Hermes error though or Chattemplate reason. But never had that kind of loops with cline, opencode or OpenHands.

@szwedek
Interesting, this template is indeed very stable with OpenCode.
Until now, it worked fine. A few peg-negative messages though but run totally fine.

Im at 90k context at the moment, file conversions including testsuite into very different formats seem to run fine.

I wouldnt say its perfect but the best I have yet experienced.

I will push it to 200k and will see.
Will take a look at the raw token-stream aswell.
I will edit this message then, just for reference.

@StableQuant make sure that you set auto_disable_thinking_with_tools=false for the template before you start testing, but you can test both options for a longer context.

Sign up or log in to comment