Running Local AI Models with Claude Code
Pointing Claude Code at Ollama: open-source models on your machine, GLM-5.1 and Kimi K2.6 on Ollama Cloud, and the speed trade-off you actually live with.
I’ve been using Claude Code for months. It’s fast and reads my codebase well. But I wanted to try other models without spending more on API credits, and to keep some of the work on my machine.
Ollama now speaks the Anthropic Messages API natively, so Claude Code can talk to any model Ollama serves, local or cloud. No middleware, no proxy.
The setup
Since Ollama v0.14, the daemon exposes an Anthropic-compatible endpoint at /v1/messages. The manual route is three environment variables:
export ANTHROPIC_AUTH_TOKEN=ollama
export ANTHROPIC_API_KEY=""
export ANTHROPIC_BASE_URL=http://localhost:11434
claude --model qwen3.5:27b
On v0.15 or newer there’s a single-command path that wires everything for you:
ollama launch claude --model qwen3.5:27b
ollama launch handles the env vars and hands off to the Claude Code binary. The official guidance is at least 64K context for Claude Code’s agentic loop to behave; smaller windows force aggressive truncation and the tool calls start falling apart.
What local models actually get right
I’ve been running everything from Gemma 3 (4B) on my laptop up to GLM-4.7-Flash (30B total / 3B active, MoE) through Ollama, using them to tweak and refine this site. The results were better than I expected.
They handle code generation, scaffolding, refactoring, and code review without much hand-holding. On single-file work, the better local models land somewhere around 70–85% of frontier cloud quality, and the code never leaves the machine. That second part matters when something is under NDA or simply not yours to upload.
Where they slip is multi-file architectural reasoning and long agentic chains. Convention matching is the other weak spot: local models tend to produce generically correct code rather than code that looks like it belongs in your repo.
The gap is closing quickly. I use them daily for first passes and incremental cleanup.
The cloud models: GLM-5.1 and Kimi K2.6
Ollama Cloud routes the same Anthropic API surface to models that would not fit on consumer hardware. The :cloud suffix tells Ollama to forward the request rather than load weights locally; from Claude Code’s perspective, nothing changes.
GLM-5.1
ZAI’s (formerly Zhipu) flagship coding model, released April 2026. The published architecture is a 744B-parameter MoE with roughly 40B active per token. The Ollama library page lists a 198K context window and tool calling support, and the variant available is glm-5.1:cloud.
The benchmark to anchor on is 58.4 on SWE-Bench Pro, which is the highest open-weights score on that board at the moment. The more useful framing for daily work is the long-horizon claim: ZAI documents the model sustaining a single autonomous task for up to ~8 hours and around 1,700 reasoning steps. That maps cleanly to “kick off a refactor, walk away, come back to a branch.”
ollama launch claude --model glm-5.1:cloud
Open weights are published, but for Claude Code the cloud variant is the practical choice — Ollama’s library only ships the :cloud tag, so tool calling is what’s been validated on that path.
Kimi K2.6
Moonshot AI’s April 2026 release. The K2 family architecture is a 1T-total / 32B-active MoE with 384 routed experts (per Moonshot’s K2 paper); Ollama doesn’t republish those figures on the K2.6 library page, so treat them as upstream architecture rather than a guaranteed property of the cloud variant.
What the Ollama page does document for kimi-k2.6:cloud: native multimodal input (text + image), tool calling, and a 256K context window. Worth being honest about the rest: a lot of what Moonshot publishes about K2.6 — the agent swarm orchestration, the multi-thousand-step autonomous runs — is a property of Moonshot’s own platform and agent framework, not of the model-as-served-through-Ollama. Pointing Claude Code at :cloud gives you a strong long-context multimodal coder behind the Anthropic API, not a replica of what Moonshot ships directly.
ollama launch claude --model kimi-k2.6:cloud
The speed trade-off
Both local and Ollama Cloud paths are slower than calling Anthropic directly. A codebase investigation that ran in 1m 13s on Claude Opus took 1h 22m on a local model in the same project — about 68x slower. Even on a Mac Mini M4 Pro with 64GB of unified memory, a competent local model gives me 35–60 tokens/sec.
The workflow has to change to make that worth it. Instead of rapid-fire prompts, I queue up bigger chunks of work and use the wait time for code review, documentation, or stretching. Routing matters too: small refactors and reviews go to a local model, anything multi-file or autonomous goes to GLM-5.1 or Kimi K2.6 over :cloud.
What used to require a $100/month subscription now runs on hardware I already owned, or through Ollama Cloud at a fraction of the price. Patience is the bill.
What I’m using
After a few months of this:
- Claude Opus as the daily driver for anything that has to ship today
- Qwen 3.5 (27B) locally for refactoring and code review on this site
- GLM-5.1 (
:cloud) for long autonomous sessions - Kimi K2.6 (
:cloud) when I need vision input or a tool-call chain that won’t terminate cleanly on smaller models
Switching between them is one command. The setup takes about five minutes the first time and zero after that.
Resources:
- Ollama
- Claude Code documentation
- Ollama Anthropic API compatibility
-
ollama launchblog post - GLM-5.1 on Ollama
- Kimi K2.6 on Ollama
Be awesome.
Keep building magic. ✊
Petar 🥃