Skilllibrary ollama

Configure, serve, and manage local LLMs with Ollama — write Modelfiles, pull/push models, set GPU layers and context windows, call chat/generate/embeddings API endpoints, and troubleshoot serving issues. Use when a task involves ollama serve, ollama run, Modelfile customization, or local inference API integration. Do not use for cloud-hosted LLM APIs, vLLM/TGI serving, or GGUF quantization without Ollama.

install

source · Clone the upstream repo

git clone https://github.com/merceralex397-collab/skilllibrary

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/merceralex397-collab/skilllibrary "$T" && mkdir -p ~/.claude/skills && cp -r "$T/11-ai-llm-runtime-and-integration/ollama" ~/.claude/skills/merceralex397-collab-skilllibrary-ollama && rm -rf "$T"

manifest: 11-ai-llm-runtime-and-integration/ollama/SKILL.md

source content

Purpose

Use this skill to manage the full lifecycle of local LLM serving with Ollama — from authoring Modelfiles and pulling models through configuring GPU offload and context windows to integrating with the HTTP API for chat, completion, and embedding workloads.

When to use this skill

Use this skill when:

writing or editing a
```
Modelfile
```
(FROM, PARAMETER, SYSTEM, TEMPLATE directives)
pulling, pushing, copying, or listing models via
```
ollama pull
```
,
```
ollama push
```
,
```
ollama cp
```
,
```
ollama list
```

configuring

ollama serve

startup flags, environment variables (

OLLAMA_HOST

OLLAMA_NUM_PARALLEL

OLLAMA_MAX_LOADED_MODELS

)

tuning GPU layer offload (
```
num_gpu
```
), context window (
```
num_ctx
```
), or batch size (
```
num_batch
```
)

calling the Ollama REST API (

/api/chat

/api/generate

/api/embeddings

/api/tags

)

integrating Ollama with application code via the official Python or JS client libraries
debugging model load failures, OOM crashes, or slow first-token latency

Do not use this skill when

the task targets cloud-hosted LLM APIs (OpenAI, Anthropic, Bedrock, Vertex AI)
the task involves vLLM, TGI, or llama.cpp serving without Ollama wrapping
the work is pure GGUF quantization or model conversion without Ollama deployment
a narrower active skill already owns the specific problem

Operating procedure

Identify the target model and hardware. Confirm the model name/tag (e.g.,
```
llama3:8b-instruct-q4_K_M
```
), available VRAM, and RAM. Run
```
ollama list
```
to check currently loaded models.
Author or update the Modelfile. Set
```
FROM
```
to the base model or GGUF path. Add
```
PARAMETER num_gpu <layers>
```
for GPU offload. Set
```
PARAMETER num_ctx <tokens>
```
for context window. Define
```
SYSTEM
```
prompt and
```
TEMPLATE
```
using Go template syntax with
```
{{ .System }}
```
,
```
{{ .Prompt }}
```
, and
```
{{ .Response }}
```
placeholders.
Build and test the custom model. Run
```
ollama create <name> -f Modelfile
```
. Verify with
```
ollama run <name>
```
using a representative prompt. Check
```
ollama show <name> --modelfile
```
to confirm parameters persisted.
Configure the server for production use. Set
```
OLLAMA_HOST=0.0.0.0:11434
```
for network access. Set
```
OLLAMA_NUM_PARALLEL
```
for concurrent request slots. Set
```
OLLAMA_MAX_LOADED_MODELS
```
based on available memory. Start with
```
ollama serve
```
or the systemd unit.
Wire API integration. Use
```
POST /api/chat
```
with
```
{"model": "<name>", "messages": [...]}
```
for multi-turn chat. Use
```
POST /api/generate
```
for single-shot completions. Use
```
POST /api/embeddings
```
with
```
{"model": "<name>", "prompt": "<text>"}
```
for vector embeddings. Set
```
"stream": false
```
when you need the full response in one JSON object.
Validate performance and resource usage. Measure time-to-first-token and tokens-per-second. Monitor VRAM usage with
```
nvidia-smi
```
or
```
ollama ps
```
. If OOM occurs, reduce
```
num_gpu
```
layers or switch to a smaller quantization.
Document the configuration. Record the final Modelfile, environment variables, hardware specs, and measured throughput in the project README or deployment doc.

Decision rules

Default to
```
num_ctx 4096
```
unless the use case requires longer context; doubling context quadruples KV-cache memory.
Prefer
```
q4_K_M
```
quantization for balanced quality/speed; use
```
q5_K_M
```
when quality is critical and VRAM allows.
Always pin model tags with explicit quantization suffixes — never rely on
```
latest
```
.
Use
```
stream: true
```
for user-facing chat; use
```
stream: false
```
for programmatic extraction pipelines.
If the model must fit in CPU-only RAM, set
```
num_gpu 0
```
and expect 5-10x slower inference.

Output requirements

```
Modelfile
```
— complete file with FROM, PARAMETER, SYSTEM, and TEMPLATE directives
```
Server Configuration
```
— environment variables and startup command
```
API Integration Code
```
— client code with endpoint, model name, and key parameters
```
Performance Baseline
```
— measured tokens/sec, VRAM usage, and context window tested

References

Read these only when relevant:

```
references/modelfile-syntax.md
```
```
references/ollama-api-endpoints.md
```
```
references/gpu-memory-estimation.md
```

Related skills

```
llm-integration
```
```
local-llm
```
```
llama-cpp
```
```
vllm-serving
```

Anti-patterns

Using
```
ollama run
```
in production scripts instead of the HTTP API — breaks on concurrent requests.
Setting
```
num_ctx
```
to the model's maximum without checking available VRAM — causes silent OOM and fallback to CPU.
Omitting the SYSTEM directive in a Modelfile and relying on per-request system prompts — leads to inconsistent behavior across clients.
Pulling models by bare name (
```
ollama pull llama3
```
) without specifying a quantization tag — creates non-reproducible deployments.

Failure handling

If
```
ollama serve
```
fails to start, check port conflicts on 11434 and verify the
```
OLLAMA_MODELS
```
directory has write permissions.
If a model fails to load with OOM, reduce
```
num_gpu
```
by half and retry; if still failing, switch to a smaller quantization variant.
If
```
/api/chat
```
returns empty or truncated responses, verify
```
num_ctx
```
is large enough for the prompt and check
```
num_predict
```
is not set to zero.
If embedding requests return 404, confirm the model supports embeddings — not all chat models expose the embedding endpoint.