When Qwen3 arrived in April 2026, a lot of Ollama users were immediately confused. Responses were suddenly much longer, with elaborate reasoning traces appearing before the actual answer. Nothing was broken. Qwen3 ships with Ollama thinking mode enabled by default, and it was doing exactly what it was designed to do. This guide explains what thinking mode actually is, which models support it, how to control it across every interface Ollama provides, and when you should turn it off entirely.
What Is Thinking Mode in Ollama?
Thinking mode is Ollama’s implementation of chain-of-thought reasoning. When enabled, a model works through a problem step by step before producing a final answer. The internal reasoning process, sometimes called the reasoning trace or scratchpad, is visible in the output and separated from the final response.
This is not a new concept. DeepSeek R1’s chain-of-thought reasoning was always active with no way to turn it off. Qwen3 changed that. It introduced a proper toggle so you can enable thinking for complex tasks and disable it when you just want a fast, direct answer. That flexibility is the key difference between the two approaches.
In practice, a thinking-mode response outputs a <think> block first, working through the model’s reasoning, then delivers the final answer. With thinking disabled, you get the answer directly with no preamble.
Which Ollama Models Support Thinking Mode?
Thinking mode is currently part of the Qwen3 model family. All Qwen3 sizes support the feature, and thinking is enabled by default across all of them. Here is a reference table covering the available sizes and what to expect on typical hardware:
| Model | Min RAM (thinking on) | Thinking quality | Best for |
|---|---|---|---|
| qwen3:0.6b | 2 GB | Basic | Testing only |
| qwen3:1.7b | 3 GB | Limited | Low-power devices |
| qwen3:4b | 5 GB | Moderate | Everyday reasoning on 8 GB machines |
| qwen3:8b | 8 GB | Good | Most users, best value |
| qwen3:14b | 12 GB | Strong | Complex coding and maths |
| qwen3:32b | 24 GB | Very strong | Demanding reasoning tasks |
| qwen3:235b | 128 GB+ | Excellent | Server-grade hardware only |
If you have been using Qwen2.5 on Ollama, Qwen3 is a significant step up in reasoning capability. For most users on consumer hardware, qwen3:8b gives the best balance of quality and speed. The 0.6B and 1.7B models can think, but the reasoning depth at those sizes is limited. They are useful for testing the feature rather than relying on it for serious work.
Models outside the Qwen3 family, such as Llama 3, Mistral, and Phi-4, do not support thinking mode. Passing think: true to one of these models has no effect. The think parameter is silently ignored rather than throwing an error, so always verify your model is in the Qwen3 family before relying on it.
How to Enable or Disable Ollama Thinking Mode
Ollama gives you three ways to control thinking mode: a command-line flag when starting a session, a command within an interactive session, and a parameter in API calls. All three are straightforward once you know where to look.
CLI: the –think and –nothink flags
The simplest approach is to pass a flag when you run the model:
ollama run qwen3:8b --thinkThis enables thinking mode explicitly. Since thinking is on by default for Qwen3, this flag is mainly useful for clarity or for future models where the default might differ.
To disable thinking and get direct answers without reasoning traces:
ollama run qwen3:8b --nothinkWith --nothink, the model skips the reasoning trace entirely and responds like a standard language model. For quick questions, summarisation, or any task where speed matters more than depth, this is the flag to use.
Interactive session: /set think and inline switching
If you are already inside an Ollama interactive session, you can toggle thinking mode without restarting:
/set think on/set think offYou can also switch on a per-message basis using inline commands at the start of your prompt. Place /think at the beginning of a message to enable thinking for that response only, or /nothink to disable it for that message:
/think Explain the algorithmic complexity of quicksort versus mergesort/nothink What is the capital of France?This is particularly useful in multi-turn conversations where you want deep reasoning for some questions and fast answers for others, all within the same session without restarting.
API: the think parameter
When calling Ollama via the REST API, add a think field to your request body:
curl http://localhost:11434/api/generate
-d '{
"model": "qwen3:8b",
"prompt": "A train travels at 60 mph for 2.5 hours. How far does it travel?",
"think": true
}'To disable thinking in an API call:
curl http://localhost:11434/api/generate
-d '{
"model": "qwen3:8b",
"prompt": "Summarise this paragraph in one sentence.",
"think": false
}'In Python using the Ollama library:
import ollama
response = ollama.generate(
model='qwen3:8b',
prompt='Walk me through solving this calculus problem step by step.',
think=True
)
print(response['thinking']) # the reasoning trace
print(response['response']) # the final answerThe response object includes a separate thinking field containing the reasoning trace and a response field with the final answer, so you can use each independently in your application. If you are generating structured JSON output from Ollama, pass think: false explicitly, since thinking mode can interfere with strict schema adherence.
The –hidethinking Flag
There is a third flag that most guides skip over entirely: --hidethinking. It works alongside --think and does something subtly different from simply disabling reasoning.
With --hidethinking, the model still performs its full chain-of-thought reasoning internally. You still get all the depth and accuracy benefits of thinking mode. The reasoning trace is simply not included in the output. Only the final answer is returned.
ollama run qwen3:8b --think --hidethinkingThis is particularly valuable when building production applications or local APIs where end users should only see clean answers. The model reasons just as thoroughly, but the thinking stays private.
Think of it this way: --think enables the reasoning process and --hidethinking controls whether that process is visible in the output. If you want reasoning quality without reasoning traces appearing in your application’s response, these two flags work together to achieve exactly that.
When to Use Thinking Mode and When to Turn It Off
Thinking mode is not always the right choice. The core tradeoff is quality versus speed. When thinking is enabled, the model generates a reasoning trace before answering. That can add hundreds or thousands of tokens to every response, and on slower hardware the delay is noticeable. On a machine with 8 GB of VRAM running qwen3:8b, a simple question might take 2 seconds without thinking and 5 to 8 seconds with it enabled.
| Task type | Thinking mode | Reason |
|---|---|---|
| Maths and problem solving | Enable | Step-by-step reasoning dramatically improves accuracy |
| Complex coding tasks | Enable | Model works through logic before writing code |
| Multi-step analysis or planning | Enable | Structured reasoning catches errors early |
| Debugging and code review | Enable | Reasoning traces reveal how the model reads your code |
| Simple factual questions | Disable | No reasoning benefit, adds unnecessary latency |
| Summarisation or classification | Disable | The task does not benefit from chain-of-thought |
| Chat and conversational use | Disable | Responses feel unnatural with visible reasoning traces |
| Production API | Disable or use –hidethinking | Latency is the priority; use –hidethinking to keep quality |
Ollama’s best models for reasoning and maths tasks all benefit from thinking mode being enabled for those specific use cases. For everything else, the overhead is rarely worth it. The quality improvement that thinking mode provides is genuinely meaningful for tasks that involve multi-step logic. For a casual question, you will not notice any difference in output quality and you will certainly notice the wait.
Disabling Thinking Mode by Default in a Modelfile
If you find yourself always running Qwen3 with --nothink, you can make that the default by creating a Modelfile. This saves you from passing the flag every time you start a session, and it works cleanly in UI tools like Open WebUI where you cannot pass raw API parameters per request.
FROM qwen3:8b
PARAMETER nothink trueSave that as a file called Modelfile, then build a custom model from it:
ollama create qwen3-fast -f Modelfile
ollama run qwen3-fastThinking mode will now be off by default for that model. You can still override it per-session with --think if you need reasoning on a particular task. Create the model once, then select it from any Ollama-compatible interface and it will behave consistently without any extra configuration.
Thinking mode is one of the most significant additions to Ollama in recent months, and Qwen3 is the model that made it genuinely practical. Whether you leave it on for deep analysis, turn it off for speed, or hide the trace with --hidethinking for production use, knowing how to control it puts you in a much stronger position than relying on the defaults. For a broader overview of how Ollama works and the full range of models it supports, the complete Ollama guide covers everything you need.