
Alright, I’m tired of all the thinking issues with Qwen 3.6. I’ve written about how to fix this for the most part. However, there are certain bugs in llama.cpp and Ollama that may or may not be fixed:
- llama.cpp doesn’t detect when a connection has been broken, and keeps running until the request completes – that’s silly, since you may have hit “stop” in your harness, but Ollama ignores it.
- llama.cpp’s API doesn’t have a /stop command or cancellation token.
- Ollama doesn’t attempt to stop the processing on the llama.cpp side when a connection drops, exacerbating the issue. There was a simple PR to fix this, but the Ollama team nixed it. I’ve asked why but have yet to receive a response.
But do you really need thinking mode? You would think so. But Qwen already reasons deeply. So you may not actually need the thinking commentary. Wait, reasoning and thinking are different? Yes, sir! Asking Copilot to summarize this so I don’t have to type as much đ
Reasoning in an LLM refers to the model producing token sequences that follow logical or inferential patterns (deduction, induction, stepwise inference) to get from premises to a conclusion, while Thinking usually describes the model emitting intermediate textânotes, chains of thought, or refinementsâthat make its process visible and let it explore alternative answers; reasoning is the type of structured inference, thinking is the manifestation or workflow (often exposed as stepâbyâstep output) used to arrive at or refine those inferences, and neither implies humanâstyle understandingâboth are emergent behaviors of nextâtoken prediction with tradeoffs in cost, latency, and reliability.
So, unless you need deep thinking with a model that already has high quality reasoning baked-in, you may be able to run with thinking turned off for most operations. This can lead to better performance, faster answers, and an elimination of thinking loops. From what I’ve been reading and experiencing, running with thinking off all the time works well. If you don’t like an answer, point the model in a different direction and let it take it from there. At least in that case, you’re failing quickly.
So, I’ve experimented with this, and the example is below. By simply adding <think></think> to the prompt, it prevents the chain-of-thought tokens (CoT) from being sent to the requestor. From what I’m reading, this doesn’t actually turn off thinking, but eliminates the verbose responses. According to my research, because it’s hard to find a definitive answer:
Putting <think></think> in the modelfile usually only controls whether the model shows its intermediate steps, not whether any internal deliberation occurs; in many LLM implementations âthinkingâ is just the model emitting extra tokens (a chainâofâthought) so closing the tag suppresses that visible trace, but the model may still perform the same hidden activations and context conditioning behind the scenes or use a different decoding pathâin other setups the tag can also be a hard signal that disables the extra stepwise generation entirely, so the only reliable way to know is to compare outputs (accuracy, confidence, and latency) with and without the tag.
So here’s the modified prompt part of the modelfile. You can use my online modelfile editor to insert this if you like:
TEMPLATE """
{{ .System }}<think></think>
{{ .Prompt }}
"""
And you can download my model file here.
And you can install it as follows:
ollama create Qwen3.6NoThink -f QwenNoThink.txt
Caveats:
- Sometimes it starts thinking again and I don’t know why.
- Even with thinking disabled, if it starts thinking again, it can get back into its loops. I think this is a bug in Ollama. Hey, it’s only at version 0.30.x (when I wrote this), so it’ll get better. Or something better will be released… Likely also based on llama.cpp for the time being.
- While I’ve had very good luck with this modelfile, I still find myself restarting Ollama server from time to time. These could be harness bugs as well.
Let me know what you think! Enjoy!