Skilllibrary ollama
Configure, serve, and manage local LLMs with Ollama — write Modelfiles, pull/push models, set GPU layers and context windows, call chat/generate/embeddings API endpoints, and troubleshoot serving issues. Use when a task involves ollama serve, ollama run, Modelfile customization, or local inference API integration. Do not use for cloud-hosted LLM APIs, vLLM/TGI serving, or GGUF quantization without Ollama.
git clone https://github.com/merceralex397-collab/skilllibrary
T=$(mktemp -d) && git clone --depth=1 https://github.com/merceralex397-collab/skilllibrary "$T" && mkdir -p ~/.claude/skills && cp -r "$T/11-ai-llm-runtime-and-integration/ollama" ~/.claude/skills/merceralex397-collab-skilllibrary-ollama && rm -rf "$T"
11-ai-llm-runtime-and-integration/ollama/SKILL.mdPurpose
Use this skill to manage the full lifecycle of local LLM serving with Ollama — from authoring Modelfiles and pulling models through configuring GPU offload and context windows to integrating with the HTTP API for chat, completion, and embedding workloads.
When to use this skill
Use this skill when:
- writing or editing a
(FROM, PARAMETER, SYSTEM, TEMPLATE directives)Modelfile - pulling, pushing, copying, or listing models via
,ollama pull
,ollama push
,ollama cpollama list - configuring
startup flags, environment variables (ollama serve
,OLLAMA_HOST
,OLLAMA_NUM_PARALLEL
)OLLAMA_MAX_LOADED_MODELS - tuning GPU layer offload (
), context window (num_gpu
), or batch size (num_ctx
)num_batch - calling the Ollama REST API (
,/api/chat
,/api/generate
,/api/embeddings
)/api/tags - integrating Ollama with application code via the official Python or JS client libraries
- debugging model load failures, OOM crashes, or slow first-token latency
Do not use this skill when
- the task targets cloud-hosted LLM APIs (OpenAI, Anthropic, Bedrock, Vertex AI)
- the task involves vLLM, TGI, or llama.cpp serving without Ollama wrapping
- the work is pure GGUF quantization or model conversion without Ollama deployment
- a narrower active skill already owns the specific problem
Operating procedure
-
Identify the target model and hardware. Confirm the model name/tag (e.g.,
), available VRAM, and RAM. Runllama3:8b-instruct-q4_K_M
to check currently loaded models.ollama list -
Author or update the Modelfile. Set
to the base model or GGUF path. AddFROM
for GPU offload. SetPARAMETER num_gpu <layers>
for context window. DefinePARAMETER num_ctx <tokens>
prompt andSYSTEM
using Go template syntax withTEMPLATE
,{{ .System }}
, and{{ .Prompt }}
placeholders.{{ .Response }} -
Build and test the custom model. Run
. Verify withollama create <name> -f Modelfile
using a representative prompt. Checkollama run <name>
to confirm parameters persisted.ollama show <name> --modelfile -
Configure the server for production use. Set
for network access. SetOLLAMA_HOST=0.0.0.0:11434
for concurrent request slots. SetOLLAMA_NUM_PARALLEL
based on available memory. Start withOLLAMA_MAX_LOADED_MODELS
or the systemd unit.ollama serve -
Wire API integration. Use
withPOST /api/chat
for multi-turn chat. Use{"model": "<name>", "messages": [...]}
for single-shot completions. UsePOST /api/generate
withPOST /api/embeddings
for vector embeddings. Set{"model": "<name>", "prompt": "<text>"}
when you need the full response in one JSON object."stream": false -
Validate performance and resource usage. Measure time-to-first-token and tokens-per-second. Monitor VRAM usage with
ornvidia-smi
. If OOM occurs, reduceollama ps
layers or switch to a smaller quantization.num_gpu -
Document the configuration. Record the final Modelfile, environment variables, hardware specs, and measured throughput in the project README or deployment doc.
Decision rules
- Default to
unless the use case requires longer context; doubling context quadruples KV-cache memory.num_ctx 4096 - Prefer
quantization for balanced quality/speed; useq4_K_M
when quality is critical and VRAM allows.q5_K_M - Always pin model tags with explicit quantization suffixes — never rely on
.latest - Use
for user-facing chat; usestream: true
for programmatic extraction pipelines.stream: false - If the model must fit in CPU-only RAM, set
and expect 5-10x slower inference.num_gpu 0
Output requirements
— complete file with FROM, PARAMETER, SYSTEM, and TEMPLATE directivesModelfile
— environment variables and startup commandServer Configuration
— client code with endpoint, model name, and key parametersAPI Integration Code
— measured tokens/sec, VRAM usage, and context window testedPerformance Baseline
References
Read these only when relevant:
references/modelfile-syntax.mdreferences/ollama-api-endpoints.mdreferences/gpu-memory-estimation.md
Related skills
llm-integrationlocal-llmllama-cppvllm-serving
Anti-patterns
- Using
in production scripts instead of the HTTP API — breaks on concurrent requests.ollama run - Setting
to the model's maximum without checking available VRAM — causes silent OOM and fallback to CPU.num_ctx - Omitting the SYSTEM directive in a Modelfile and relying on per-request system prompts — leads to inconsistent behavior across clients.
- Pulling models by bare name (
) without specifying a quantization tag — creates non-reproducible deployments.ollama pull llama3
Failure handling
- If
fails to start, check port conflicts on 11434 and verify theollama serve
directory has write permissions.OLLAMA_MODELS - If a model fails to load with OOM, reduce
by half and retry; if still failing, switch to a smaller quantization variant.num_gpu - If
returns empty or truncated responses, verify/api/chat
is large enough for the prompt and checknum_ctx
is not set to zero.num_predict - If embedding requests return 404, confirm the model supports embeddings — not all chat models expose the embedding endpoint.