SerpentStack model-routing

Delegate coding subtasks to on-device local models (Ollama) to reduce cloud API costs while keeping the orchestrating model for planning and review. Use when: the user wants to save on API costs, asks about local models, mentions Ollama, or wants to use on-device models for code generation.

install
source · Clone the upstream repo
git clone https://github.com/Benja-Pauls/SerpentStack
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/Benja-Pauls/SerpentStack "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.skills/model-routing" ~/.claude/skills/benja-pauls-serpentstack-model-routing && rm -rf "$T"
manifest: .skills/model-routing/SKILL.md
source content

Model Routing: Cloud Orchestration + Local Code Generation

Use expensive cloud models (Opus, Sonnet) for planning, review, and orchestration. Delegate token-heavy code generation to on-device models via Ollama. This can reduce costs 10-50x for coding-heavy sessions.

Prerequisites

  • Ollama installed and running (
    ollama serve
    )
  • A coding-capable model pulled:
    ollama pull qwen3-coder:30b
    (recommended) or
    ollama pull glm-4.7-flash
  • Minimum 16GB RAM (24GB+ recommended for best results)

How It Works

You (developer)
  |
  v
Cloud Model (Sonnet/Opus) — orchestration, planning, review
  |
  |-- "Write the service layer for Projects"
  |       |
  |       v
  |   Local Model (Ollama) — code generation subagent
  |       |
  |       returns generated code
  |
  |-- Reviews output, checks against project conventions
  |-- Requests corrections if needed
  |-- Commits the final result

The cloud model decides WHAT to build and HOW it should work. The local model does the token-heavy GENERATION. The cloud model reviews the output.

Setup

Option 1: Claude Code Subagent (Recommended)

Create a custom subagent definition in your project that delegates to a local Ollama instance. In your

.claude/settings.json
or project config:

{
  "subagents": {
    "local-coder": {
      "description": "Fast local model for code generation tasks",
      "provider": "ollama",
      "model": "qwen3-coder:30b",
      "base_url": "http://localhost:11434",
      "tools": ["Read", "Write", "Edit", "Glob", "Grep", "Bash"],
      "prompt": "You are a code generation assistant. Follow the project conventions described below exactly. Generate only the requested code — no explanations unless asked."
    }
  }
}

Then in your workflow, the orchestrating model can delegate: "Use the local-coder subagent to generate the service file for Projects following the template in

.skills/scaffold/SKILL.md
."

Option 2: Ollama as Primary with Cloud Fallback

Set Ollama as the default model and use cloud models only for complex reasoning:

# In your shell profile or .env
export ANTHROPIC_BASE_URL=http://localhost:11434
export ANTHROPIC_AUTH_TOKEN=ollama
export CLAUDE_MODEL=qwen3-coder:30b

This makes ALL generation local. Use

/model sonnet
in Claude Code to switch to cloud for planning/review tasks.

Option 3: Manual Delegation

Without any configuration, you can simply instruct the agent:

"For the next code generation task, spawn a subagent using the local Ollama model. Have it generate the code, then review the output yourself before applying it."

What to Delegate Locally

Good candidates for local generation (token-heavy, pattern-following):

  • Boilerplate code from templates (models, schemas, services, routes)
  • Test files following existing patterns
  • Frontend components matching existing conventions
  • Migration files
  • API client functions
  • Type definitions

Keep on cloud (requires reasoning, project understanding):

  • Architecture decisions
  • Debugging complex errors
  • Code review and convention checking
  • Multi-file refactors that require understanding dependencies
  • Skill authoring and updates

Project Context for Local Models

Local models don't have your conversation history. When delegating, include the relevant skill content in the subagent prompt. For example, when generating a new service:

"Read

.skills/scaffold/SKILL.md
section '3. Service Layer' and generate a service file for Projects following that exact template. The resource name is
project
, the model is
Project
."

This gives the local model enough context to produce correct code without needing the full conversation.

Cost Comparison

TaskCloud TokensLocal TokensSavings
Generate a CRUD service (~200 lines)~800 output tokens @ $0.015/1k0 (free, on-device)100%
Generate tests (~300 lines)~1,200 output tokens0100%
Full resource scaffold (8 files)~4,000 output tokens0100%
Planning + review overhead~2,000 tokens (still on cloud)N/AN/A

A typical "add a new resource" flow: ~2,000 cloud tokens for orchestration + review, ~4,000 local tokens for generation. vs. ~6,000 cloud tokens without routing. ~65% cost reduction.

Recommended Models (March 2026)

ModelVRAMBest For
qwen3-coder:30b
16GB+General coding, strong tool calling, 256K context
glm-4.7-flash
24GBBest quality on 24GB setups
qwen3-coder:8b
8GBBudget option, still capable for template-following
deepseek-coder-v3:16b
12GBGood balance of size and quality

Verification

After any locally-generated code is applied:

make verify   # lint + typecheck + test — catches anything the local model got wrong

If verification fails, the cloud model reviews the error and either fixes it or re-delegates with more specific instructions.