Skills api-ai-ollama
Local LLM inference with the Ollama JavaScript client -- chat, streaming, tool calling, vision, embeddings, structured output, model management, and OpenAI-compatible endpoint
git clone https://github.com/agents-inc/skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/agents-inc/skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/dist/plugins/api-ai-ollama/skills/api-ai-ollama" ~/.claude/skills/agents-inc-skills-api-ai-ollama && rm -rf "$T"
dist/plugins/api-ai-ollama/skills/api-ai-ollama/SKILL.mdOllama Patterns
Quick Guide: Use the
npm package to run LLMs locally. Useollamafor conversations andollama.chat()for single prompts. Enable streaming withollama.generate()and iterate withstream: true. Usefor awaitwith a JSON schema (viaformat) for structured outputs. UsezodToJsonSchemaarray for function calling. Usetoolsfor embeddings. Models run on your machine -- no API keys required for local use, but be aware of model loading time and memory usage.ollama.embed()
<critical_requirements>
CRITICAL: Before Using This Skill
All code must follow project conventions in CLAUDE.md (kebab-case, named exports, import ordering,
, named constants)import type
(You MUST use
for conversations and ollama.chat()
for single-prompt completions -- they have different parameter shapes)ollama.generate()
(You MUST handle model loading delays -- the first request after a model is loaded takes significantly longer due to model initialization)
(You MUST use
from zodToJsonSchema()
for structured outputs -- do NOT manually construct JSON schemas)zod-to-json-schema
(You MUST accumulate streamed
, thinking
, and content
fields to maintain conversation history in multi-turn interactions)tool_calls
(You MUST never assume a model is already pulled -- check with
or handle errors from missing models gracefully)ollama.list()
</critical_requirements>
Auto-detection: Ollama, ollama, ollama.chat, ollama.generate, ollama.embed, ollama.pull, ollama.list, ollama.show, ollama.delete, ollama.ps, ollama.abort, ollama.create, keep_alive, zodToJsonSchema, OLLAMA_HOST, llama3, mistral, qwen, gemma, phi, deepseek, local LLM
When to use:
- Running LLMs locally for development, testing, or privacy-sensitive workloads
- Building chat applications with local models (Llama, Mistral, Qwen, Gemma, etc.)
- Extracting structured data from text or images using local models with JSON schemas
- Implementing tool calling / function calling with locally-hosted models
- Generating embeddings for RAG or semantic search without cloud API costs
- Managing local model lifecycle (pull, list, show, delete, copy)
- Prototyping AI features before committing to a cloud provider
Key patterns covered:
- Client setup (default and custom instances)
- Chat completions (
) and text generation (ollama.chat
)ollama.generate - Streaming with
and accumulated statefor await - Structured output with
+formatzodToJsonSchema - Tool calling with
array and multi-turn tool loopstools - Vision / multimodal inputs with
parameterimages - Embeddings with
ollama.embed() - Model management (pull, list, show, delete, copy, ps)
- OpenAI-compatible endpoint for drop-in migration
When NOT to use:
- Production workloads requiring guaranteed uptime and SLAs -- use a cloud LLM provider
- Multi-provider applications where you need to switch between OpenAI, Anthropic, Google -- use a unified provider SDK
- Applications requiring the latest proprietary models (GPT-5, Claude) -- those are cloud-only
Examples Index
- Core: Setup, Chat & Generate -- Client init, chat, generate, streaming, error handling
- Tool Calling -- Tool definitions, single/parallel calls, multi-turn agent loops
- Structured Output -- JSON schema via Zod, vision extraction
- Embeddings & Vision -- Embeddings, image analysis, multimodal
- Model Management -- Pull, list, show, delete, copy, ps
- Quick API Reference -- Method signatures, options, response types, model names
<philosophy>
Philosophy
The Ollama JavaScript library is a thin client over Ollama's local REST API (default
http://127.0.0.1:11434). It provides direct access to locally-running open-source LLMs with zero cloud dependencies.
Core principles:
- Local-first -- Models run on your hardware. No API keys required for local use, complete data privacy, no per-token costs. Trade-off: you need sufficient GPU/CPU memory.
- Simple API --
andollama.chat()
are the two primary methods. The default import is a pre-configured singleton client; create custom instances withollama.generate()
for non-default hosts.new Ollama() - Streaming by default in REST, opt-in in SDK -- The REST API streams by default. The SDK returns full responses by default; set
to get anstream: true
.AsyncGenerator - Model-agnostic -- The same API works with any Ollama-supported model (Llama, Mistral, Qwen, Gemma, Phi, DeepSeek, etc.). Model capabilities (vision, tool calling, structured output) depend on the model.
- OpenAI-compatible -- Ollama exposes
and/v1/chat/completions
endpoints, allowing the OpenAI SDK to connect with/v1/embeddings
.baseURL: 'http://localhost:11434/v1'
When to use Ollama:
- Local development and testing without cloud costs
- Privacy-sensitive workloads where data cannot leave your infrastructure
- Prototyping AI features before choosing a cloud provider
- Running open-source models (Llama, Mistral, etc.) on your own hardware
- Offline or air-gapped environments
When NOT to use:
- Production workloads requiring high availability and SLAs
- Applications needing proprietary models (GPT-5, Claude)
- When you need a multi-provider abstraction layer
<patterns>
Core Patterns
Pattern 1: Client Setup
The default import is a pre-configured singleton pointing to
http://127.0.0.1:11434.
// lib/ollama.ts -- default client (most common) import ollama from "ollama"; // Use directly -- connects to localhost:11434 const response = await ollama.chat({ model: "llama3.1", messages: [{ role: "user", content: "Hello" }], });
// lib/ollama.ts -- custom client for non-default host import { Ollama } from "ollama"; const ollama = new Ollama({ host: "http://192.168.1.100:11434", }); export { ollama };
Why good: Minimal setup, default client requires zero configuration, custom client for remote servers
// BAD: Hardcoding host inline everywhere import { Ollama } from "ollama"; const response = await new Ollama({ host: "http://192.168.1.100:11434" }).chat({ model: "llama3.1", messages: [{ role: "user", content: "Hello" }], });
Why bad: Creates a new client instance per request, no reuse, host scattered across codebase
See: examples/core.md for cloud API setup, custom headers, browser usage
Pattern 2: Chat Completions
Multi-turn conversations with message history. You manage the messages array.
import ollama from "ollama"; const response = await ollama.chat({ model: "llama3.1", messages: [ { role: "system", content: "You are a helpful coding assistant." }, { role: "user", content: "Explain TypeScript generics." }, ], }); console.log(response.message.content);
Why good: Clear message roles, system message for behavior control, direct content access
// BAD: Not checking response, no system message const res = await ollama.chat({ model: "llama3.1", messages: [{ role: "user", content: "do something" }], });
Why bad: No system instruction means unpredictable behavior, vague prompt
See: examples/core.md for multi-turn conversations, model options
Pattern 3: Text Generation
Single-prompt completions without message history. Simpler than chat for one-shot tasks.
import ollama from "ollama"; const response = await ollama.generate({ model: "llama3.1", prompt: "Write a haiku about TypeScript.", system: "You are a creative writer.", }); console.log(response.response);
Why good: Simpler API for one-shot tasks,
system parameter instead of message array
See: examples/core.md for generate with images, suffix, raw mode
Pattern 4: Streaming
Set
stream: true to get an AsyncGenerator. Iterate with for await.
import ollama from "ollama"; const stream = await ollama.chat({ model: "llama3.1", messages: [{ role: "user", content: "Explain async/await." }], stream: true, }); for await (const chunk of stream) { process.stdout.write(chunk.message.content); } console.log(); // newline
Why good: Progressive output for better UX, memory-efficient for long responses
// BAD: Not consuming the stream const stream = await ollama.chat({ model: "llama3.1", messages: [{ role: "user", content: "Hello" }], stream: true, }); // Stream never consumed -- response is lost
Why bad: Stream must be consumed via iteration, otherwise the response is silently lost
See: examples/core.md for generate streaming, abort, thinking mode streaming
Pattern 5: Structured Output with Zod
Use
format with a JSON schema to constrain model output. Use zodToJsonSchema() from zod-to-json-schema to convert Zod schemas.
import ollama from "ollama"; import { z } from "zod"; import { zodToJsonSchema } from "zod-to-json-schema"; const Country = z.object({ name: z.string(), capital: z.string(), languages: z.array(z.string()), }); type Country = z.infer<typeof Country>; const response = await ollama.chat({ model: "llama3.1", messages: [ { role: "user", content: "Tell me about France. Respond in JSON.", }, ], format: zodToJsonSchema(Country), }); const country: Country = Country.parse(JSON.parse(response.message.content)); console.log(country.capital); // "Paris"
Why good: Type-safe output via Zod, JSON schema constrains model output, parse validates response
// BAD: Using format: 'json' without a schema const response = await ollama.chat({ model: "llama3.1", messages: [{ role: "user", content: "Tell me about France as JSON" }], format: "json", }); // No schema enforcement -- model can return any JSON shape
Why bad:
format: 'json' only ensures valid JSON syntax, not structure -- use a JSON schema for reliable extraction
See: examples/structured-output.md for vision extraction, complex schemas
Pattern 6: Tool Calling
Define tools the model can request. Handle
tool_calls in responses and feed results back.
import ollama from "ollama"; const tools = [ { type: "function" as const, function: { name: "get_weather", description: "Get the current weather for a city", parameters: { type: "object", required: ["city"], properties: { city: { type: "string", description: "City name" }, }, }, }, }, ]; const response = await ollama.chat({ model: "llama3.1", messages: [{ role: "user", content: "What is the weather in Tokyo?" }], tools, }); if (response.message.tool_calls?.length) { for (const toolCall of response.message.tool_calls) { console.log(`Call: ${toolCall.function.name}`); console.log(`Args:`, toolCall.function.arguments); } }
Why good: Standard tool schema format, checks for tool_calls before processing, arguments already parsed (not stringified JSON)
See: examples/tools.md for multi-turn tool loops, streaming tool calls, parallel tools
Pattern 7: Embeddings
Use
ollama.embed() for text embeddings. Supports single or batch inputs.
import ollama from "ollama"; const EMBEDDING_MODEL = "nomic-embed-text"; const response = await ollama.embed({ model: EMBEDDING_MODEL, input: [ "TypeScript is a typed superset of JavaScript.", "Rust is a systems programming language.", ], }); console.log(`Vectors: ${response.embeddings.length}`); console.log(`Dimensions: ${response.embeddings[0].length}`);
Why good: Batch multiple inputs in one call, named constant for model, returns array of number arrays
See: examples/embeddings-vision.md for semantic search, cosine similarity
Pattern 8: Model Management
Pull, list, show, and delete models programmatically.
import ollama from "ollama"; // List available models const models = await ollama.list(); for (const model of models.models) { console.log(`${model.name} (${model.size} bytes)`); } // Pull a model with progress streaming const stream = await ollama.pull({ model: "llama3.1", stream: true }); for await (const progress of stream) { console.log( `${progress.status}: ${progress.completed ?? 0}/${progress.total ?? 0}`, ); } // Show model details const info = await ollama.show({ model: "llama3.1" }); console.log(`Parameters: ${info.details.parameter_size}`); console.log(`Quantization: ${info.details.quantization_level}`); // Delete a model await ollama.delete({ model: "old-model" });
Why good: Streaming progress for large downloads, programmatic model lifecycle, detailed model metadata
See: examples/model-management.md for copy, create, running models (ps)
Pattern 9: OpenAI-Compatible Endpoint
Ollama exposes
/v1/chat/completions and /v1/embeddings that work with the OpenAI SDK.
import OpenAI from "openai"; const OLLAMA_BASE_URL = "http://localhost:11434/v1"; const client = new OpenAI({ baseURL: OLLAMA_BASE_URL, apiKey: "ollama", // Required by SDK but unused by Ollama }); const completion = await client.chat.completions.create({ model: "llama3.1", messages: [{ role: "user", content: "Why is the sky blue?" }], }); console.log(completion.choices[0].message.content);
Why good: Drop-in replacement for OpenAI SDK code, named constant for URL, easy to switch between local and cloud
When to use: When migrating existing OpenAI SDK code to local models, or when you want to use OpenAI SDK tooling (structured outputs, streaming helpers) with local models
When not to use: For new Ollama-native code, prefer the
ollama package directly -- it exposes Ollama-specific features (model management, keep_alive, thinking mode) that the OpenAI compat layer does not
See: reference.md for supported and unsupported OpenAI features
</patterns><decision_framework>
Decision Framework
ollama.chat()
vs ollama.generate()
ollama.chat()ollama.generate()Need multi-turn conversation history? +-- YES -> ollama.chat() (messages array with roles) +-- NO -> Is it a single prompt completion? +-- YES -> ollama.generate() (simpler API) +-- NO -> ollama.chat() (default choice for most use cases)
Native Ollama SDK vs OpenAI-Compatible Endpoint
Do you have existing OpenAI SDK code to migrate? +-- YES -> Use OpenAI SDK with baseURL: 'http://localhost:11434/v1' +-- NO -> Do you need Ollama-specific features? +-- YES -> Use ollama package (model management, keep_alive, thinking, abort) +-- NO -> Either works, prefer ollama package for new code
Structured Output vs Plain Text
Do you need structured data from the model? +-- YES -> Use format parameter with zodToJsonSchema() | +-- Always include "respond in JSON" in the prompt | +-- Always parse and validate with Zod after receiving response +-- NO -> Omit format parameter (plain text response)
Model Selection Guidance
What is your task? +-- General chat / coding -> llama3.1 (8B for speed, 70B for quality) +-- Fast + small -> phi4-mini, gemma3 (smaller memory footprint) +-- Code generation -> qwen2.5-coder, deepseek-coder-v2 +-- Vision/multimodal -> llama3.2-vision, gemma3 +-- Embeddings -> nomic-embed-text, all-minilm +-- Tool calling -> llama3.1, qwen3, mistral +-- Reasoning/thinking -> qwen3 (with think: true), deepseek-r1
</decision_framework>
<red_flags>
RED FLAGS
High Priority Issues:
- Not handling model loading time -- first request after model load can take 30+ seconds on CPU; show a loading indicator or set
to keep models warmkeep_alive - Using
instead of a JSON schema -- only ensures valid JSON syntax, not structure; always useformat: 'json'
for reliable extractionzodToJsonSchema() - Not accumulating streamed fields in multi-turn conversations -- you must collect
,thinking
, andcontent
from all chunks to maintain historytool_calls - Assuming all models support all features -- tool calling, vision, and structured output depend on the model; check model capabilities first
Medium Priority Issues:
- Not setting
for latency-sensitive applications -- models unload after 5 minutes of inactivity by default; setkeep_alive
orkeep_alive: '30m'
(indefinite) for persistent sessionskeep_alive: -1 - Creating new
instances per request instead of reusing a singleton clientOllama() - Not handling
when usingAbortError
-- listening threads throw when streams are cancelledollama.abort() - Ignoring model size vs available memory -- loading a 70B model on 8GB RAM will fail or swap heavily
Common Mistakes:
- Confusing
andollama.chat()
parameters --ollama.generate()
useschat
,messages[]
usesgenerate
andpromptsystem - Using
(deprecated) instead ofollama.embeddings()
-- the newerollama.embed()
method supports batch inputsembed() - Passing image URLs to
parameter -- Ollama expectsimages
or base64-encoded strings, not URLsUint8Array - Using tool calling with models that do not support it -- not all models handle tools; use Llama 3.1+, Qwen 3, or Mistral for reliable tool calling
- Forgetting to
the response content when using structured output -- Ollama returns JSON as a string inJSON.parse()
, not a parsed objectmessage.content
Gotchas & Edge Cases:
- Ollama returns tool call arguments as already-parsed objects (not JSON strings like OpenAI) --
is an object, not a stringtoolCall.function.arguments - The
parameter accepts both duration strings (keep_alive
,'5m'
) and numbers (seconds) --'1h'
unloads immediately,0
keeps loaded indefinitely-1
cancels ALL active streams for that client instance, not individual requestsollama.abort()- Model names can include tags (
,llama3.1:8b
) -- omitting the tag uses the default (usually smallest)llama3.1:70b - The
parameter enables extended reasoning but only works with models that support it (Qwen 3, DeepSeek R1) -- it adds athink
field to the response alongsidethinkingcontent - Browser usage requires importing from
instead ofollama/browser
-- the default import uses Node.js-specific APIsollama - Cloud API access (ollama.com) requires an API key via
header and settingAuthorization: Bearerhost: 'https://ollama.com' - Response includes performance metrics:
,total_duration
,eval_count
(in nanoseconds) -- calculate tokens/second witheval_durationeval_count / eval_duration * 1e9
</red_flags>
<critical_reminders>
CRITICAL REMINDERS
All code must follow project conventions in CLAUDE.md (kebab-case, named exports, import ordering,
, named constants)import type
(You MUST use
for conversations and ollama.chat()
for single-prompt completions -- they have different parameter shapes)ollama.generate()
(You MUST handle model loading delays -- the first request after a model is loaded takes significantly longer due to model initialization)
(You MUST use
from zodToJsonSchema()
for structured outputs -- do NOT manually construct JSON schemas)zod-to-json-schema
(You MUST accumulate streamed
, thinking
, and content
fields to maintain conversation history in multi-turn interactions)tool_calls
(You MUST never assume a model is already pulled -- check with
or handle errors from missing models gracefully)ollama.list()
Failure to follow these rules will produce unreliable, poorly-typed, or broken local LLM integrations.
</critical_reminders>