Ollama Cloud launched in September 2025 and quietly changed what local AI means. The feature lets you run models like DeepSeek-V3.1 (671B parameters) or Qwen3-Coder (480B) from any machine — laptop, Raspberry Pi, CI pipeline — without a local GPU, without downloading model weights, and using the exact same Ollama commands you already know. The only change is a :cloud suffix on the model name. This guide explains how the routing works under the hood, what the 37 available cloud models are, how GPU-time pricing differs from token-based competitors, and when running cloud inference actually makes sense over local.
What Is Ollama Cloud?
Ollama Cloud is a managed inference service built directly into the Ollama runtime. It extends the local Ollama daemon to route certain model requests to Ollama’s own datacentres across the US, Europe, and Asia-Pacific, rather than executing on local hardware. Cloud models are identified by a :cloud suffix and never need to be downloaded — no ollama pull required before running, no disk space consumed, no VRAM limit to worry about.
From an application’s perspective, nothing changes. The client still sends requests to localhost:11434. The local daemon detects the :cloud suffix, attaches your authentication credentials, and proxies the request to Ollama’s infrastructure. The response streams back in real time. Every tool that works with local Ollama — Open WebUI, Python SDK, JavaScript SDK, any OpenAI-compatible client — works identically with cloud models.
How Ollama Cloud Routing Works
Understanding the architecture avoids the most common gotchas. When you make a request with a :cloud model name, the local Ollama daemon’s generate and chat handlers detect a modelSourceCloud flag, normalise the model name for the remote endpoint, attach auth headers from your stored credentials, and forward the request to Ollama’s cloud infrastructure. The response is streamed back through the same local proxy in exactly the same format as a local inference response.
This proxy architecture means:
- No model download — the model lives on Ollama’s servers, not yours
- Identical API surface — same endpoints, same request format, same streaming behaviour
- Auth is transparent — once you run
ollama signin, credentials are stored and attached automatically - Local models are unaffected — cloud and local models coexist; model names route to the correct backend
One environment variable controls cloud access entirely: setting OLLAMA_NO_CLOUD=1 disables all cloud routing, useful for air-gapped environments or enforcing fully local operation.
Available Ollama Cloud Models
As of April 2026, there are 37 cloud models available. The full list is at ollama.com/search?c=cloud. Here are the most notable, grouped by use case:
| Model | Parameters | Context | Best for |
|---|---|---|---|
| qwen3-coder:480b-cloud | 480B | 262K | Coding, agents |
| devstral-2:123b-cloud | 123B | 262K | Coding agents (Mistral) |
| deepseek-v3.1:671b-cloud | 671B | 164K | General reasoning |
| deepseek-v3.2:cloud | 671B | 164K | General reasoning (latest) |
| gpt-oss:120b-cloud | 120B | 131K | General purpose |
| gpt-oss:20b-cloud | 20B | 131K | Faster, lighter tasks |
| glm-5:cloud | 744B (40B active) | 203K | MoE efficiency |
| kimi-k2.5:cloud | — | 262K in / 262K out | Long-context tasks |
| kimi-k2.6:cloud | — | 262K | Multimodal + long context |
| gemma4:31b-cloud | 31B | 262K | Google’s compact model |
| gemini-3-flash-preview:cloud | — | 1M | Massive context tasks |
| nemotron-3-nano:30b-cloud | 30B | 1M | NVIDIA’s 1M context model |
| mistral-large-3:675b-cloud | 675B | 262K | Mistral’s flagship |
| qwen3.5:cloud | 0.8B–122B | 262K | Multiple sizes via same name |
Notable: Gemini-3-Flash and Nemotron-3-Nano offer 1 million token context windows — entirely impractical to run locally, trivial to use via :cloud.
Getting Started with Ollama Cloud
Three steps. Make sure you are on Ollama 0.6.x or later first — see how to update Ollama if needed.
Step 1: Sign in
ollama signin
This opens a browser tab to ollama.com/connect where you log in with your Ollama account. Credentials are stored locally and attached automatically to all subsequent cloud model requests.
Step 2: Run a cloud model
ollama run deepseek-v3.1:671b-cloud
No pull required. The model runs instantly on Ollama’s infrastructure. You will see streaming output in your terminal exactly as you would with a local model.
Step 3: Use it in your code
import ollama
response = ollama.chat(
model='deepseek-v3.1:671b-cloud',
messages=[{'role': 'user', 'content': 'Explain attention mechanisms.'}]
)
print(response.message.content)
The Python SDK, JavaScript SDK, and any OpenAI-compatible client work without modification — just swap the model name.
The API Endpoint Gotcha (Direct Access Without the Local Daemon)
If you want to call Ollama Cloud directly from a server or container without a local Ollama daemon installed, use an API key and call the cloud endpoint directly. This is where most people get tripped up by a URL mismatch.
Get your API key from ollama.com/settings/keys, then:
export OLLAMA_API_KEY=your_key_here
# CORRECT:
curl https://ollama.com/v1/chat/completions \
-H "Authorization: Bearer $OLLAMA_API_KEY" \
-H "Content-Type: application/json" \
-d '{"model": "deepseek-v3.1:671b", "messages": [{"role": "user", "content": "Hello"}]}'
The common mistake is using /api/v1/ instead of /v1/ — the former returns a 404 with no helpful error message. The correct base URL is https://ollama.com/v1, which is OpenAI-compatible.
Using the Python client directly against the cloud endpoint (no local daemon):
import os
from ollama import Client
client = Client(
host='https://ollama.com',
headers={'Authorization': 'Bearer ' + os.environ['OLLAMA_API_KEY']}
)
response = client.chat(
'deepseek-v3.1:671b',
messages=[{'role': 'user', 'content': 'Hello'}]
)
print(response.message.content)
Note: when calling directly (no local daemon), omit the :cloud suffix — the cloud endpoint already knows it is serving cloud models. The :cloud suffix is only needed when routing through the local daemon.
Switching Between Local and Cloud Models
One of Ollama Cloud’s most practical design decisions: switching between local and cloud execution requires zero code changes beyond the model name. The :cloud suffix is the entire switch.
# Local execution — runs on your GPU
response = ollama.chat('llama3.2:3b', messages=messages)
# Cloud execution — identical call, runs on Ollama's servers
response = ollama.chat('gpt-oss:120b-cloud', messages=messages)
This makes it straightforward to build hybrid workflows: use a small local model for cheap, fast tasks and a large cloud model only for the steps that need it. The Ollama REST API guide covers the full parameter surface that applies to both local and cloud requests.
Ollama Cloud Pricing Explained
Ollama Cloud uses GPU-time billing, not per-token pricing. This is a fundamental difference from OpenAI, Anthropic, and Groq — you pay for how long the GPU runs, not how many tokens you generate. There are no per-token caps published.
| Plan | Price | Concurrent models | Usage ceiling |
|---|---|---|---|
| Free | $0/month | 1 | Light usage |
| Pro | $20/month or $200/year | 3 | ~50× more than Free |
| Max | $100/month | 10 | ~5× more than Pro |
A few things worth knowing about the billing model:
- Session limits reset every 5 hours; weekly limits reset every 7 days
- Local inference is always unlimited and free — cloud billing only applies to
:cloudmodel requests - Concurrency matters — the Free tier allows only one cloud model at a time, which affects agent workflows that need parallel model calls
- Performance tiers: shared infrastructure runs at approximately 95 tokens/second; dedicated capacity runs at ~210 tokens/second
- Ollama has indicated that future plans will allow purchasing additional usage at per-token rates with cache-aware pricing
When to Use Cloud vs Local
Ollama Cloud is not a replacement for local inference — it is a complement. Here is a practical framework for deciding which to use:
Use Ollama Cloud when:
- You need a model that won’t fit locally — 671B models need 400GB+ VRAM; cloud has no such constraint
- Context window is the bottleneck — 1M token context on Gemini-3-Flash or Nemotron-3-Nano is impossible locally on consumer hardware
- You are running on a low-power machine — Raspberry Pi, old laptop, or a CI/CD pipeline where a local GPU is not available
- You want to evaluate a model before committing to hardware — test quality on cloud before buying a GPU to run it locally
- Team access — Pro and Max plans support multiple concurrent sessions, making shared team access practical
Stick with local inference when:
- Privacy is non-negotiable — even with Ollama’s no-data-retention policy, local inference keeps data entirely off any external server
- You have the hardware and the model fits — local inference has zero recurring cost and often lower latency for models up to ~30B on a good GPU
- High volume, sustained inference — at scale, GPU-time billing adds up; amortised hardware cost becomes cheaper
- You need Thinking Mode — Ollama’s thinking mode with Qwen3 currently works on local models; cloud availability varies by model
Ollama Cloud vs Groq, OpenAI, and Anthropic
For developers choosing a cloud inference provider, here is how Ollama Cloud compares on the dimensions that matter most:
| Ollama Cloud | Groq | OpenAI | Anthropic | |
|---|---|---|---|---|
| Speed (tok/s) | ~95–210 | 300–1,240 | Variable | Variable |
| Pricing model | GPU-time subscription | Per-token | Per-token | Per-token |
| Custom models | Yes — push any Ollama model | No | Fine-tuning only | No |
| Local/cloud switching | Single model name change | Different SDK/endpoint | Different SDK/endpoint | Different SDK/endpoint |
| Data retention | None (per Ollama) | Limited | Yes (opt-out) | Limited |
| Free tier | Yes | Yes | No | No |
Groq wins on raw speed. Ollama Cloud wins on custom model support and the seamless local/cloud switching experience. If you are already using Ollama locally, Cloud is the lowest-friction way to add large-model access — no new SDKs, no new endpoints to learn, no parallel toolchain to maintain.
Troubleshooting Ollama Cloud
Cloud model returns 401 Unauthorized
Your session has expired or credentials were not stored correctly. Run ollama signin again and try the request. If you are using a direct API key, verify the key at ollama.com/settings/keys and confirm OLLAMA_API_KEY is exported in your shell session.
Model runs locally instead of on cloud (responds instantly with wrong output)
Check the model name — the :cloud suffix is required. ollama run deepseek-v3.1:671b and ollama run deepseek-v3.1:671b-cloud are different models. If you see a fast response that looks like a much smaller model, the local version ran instead.
Streaming stops mid-response
Usually a network timeout between your machine and Ollama’s servers. Retry the request. If it happens consistently, check whether a proxy or firewall is terminating long-lived connections — Ollama Cloud uses the same streaming protocol as local inference and requires persistent connections for long responses.
OLLAMA_NO_CLOUD=1 set but cloud models still route
Confirm the variable is exported (not just set) in the shell where Ollama is running: export OLLAMA_NO_CLOUD=1. If Ollama is running as a systemd service, add the variable to the service override file rather than the shell environment — the service does not inherit interactive shell variables.
Direct API calls return 404
You are using the wrong endpoint path. The correct URL is https://ollama.com/v1/chat/completions — not /api/v1/, not /api/chat. The /v1 path is the OpenAI-compatible endpoint. Double-check the URL and retry.
The Minions Protocol: Local + Cloud Collaboration
Stanford’s Hazy Research lab built a framework called Minions directly on top of Ollama Cloud that shows what hybrid local/cloud inference can look like in practice. Rather than routing every request to the cloud, Minions uses the cloud model as an orchestrator and small local models as workers.
In the Minion mode, a cloud model and a single local model work in dialogue — the cloud model breaks down the problem and critiques the local model’s output. Results: 30× cost reduction versus full cloud inference, retaining 87% of the full cloud performance.
In MinionS mode, the cloud model distributes subtasks to multiple local model instances running in parallel. Results: 5.7× cost reduction with 97.9% of full cloud performance retained.
For most users, Minions is an advanced use case, but it illustrates the direction Ollama Cloud is pointing: not a replacement for local inference, but a top-of-hierarchy coordinator that makes the combination of local hardware and cloud capacity more capable than either alone.






