Skilllibrary context-management-memory

Name: context-management-memory
Author: merceralex397-collab

install

source · Clone the upstream repo

git clone https://github.com/merceralex397-collab/skilllibrary

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/merceralex397-collab/skilllibrary "$T" && mkdir -p ~/.claude/skills && cp -r "$T/11-ai-llm-runtime-and-integration/context-management-memory" ~/.claude/skills/merceralex397-collab-skilllibrary-context-management-memory && rm -rf "$T"

manifest: 11-ai-llm-runtime-and-integration/context-management-memory/SKILL.md

source content

Purpose

Manage LLM context windows: token counting, budget allocation, message pruning, and context compression techniques.

When to use this skill

hitting context length limits and need to fit more content
implementing token counting for prompt budget management
designing message truncation or sliding window strategies
compressing context with summarization or extraction

Do not use this skill when

designing agent memory stores — prefer
```
agent-memory
```
setting up vector databases — prefer
```
embeddings-indexing
```
choosing which model to use — prefer
```
model-selection
```

Procedure

Count tokens accurately — use
```
tiktoken
```
for OpenAI models, model-specific tokenizers for others. Never estimate by word count.
Set budget allocation — divide context: system prompt (fixed), retrieved context (variable), conversation history (sliding), generation headroom (reserved).
Implement sliding window — keep last N messages. When over budget, remove oldest user/assistant pairs (keep system prompt).
Summarize overflow — when pruning messages, summarize removed content into a single system message. Preserve key facts.
Compress retrieved context — extract relevant sentences from documents instead of including full text. Use LLM extraction if needed.
Use structured prompts — JSON/YAML-structured prompts are more token-efficient than verbose natural language for data-heavy contexts.
Monitor usage — log
```
prompt_tokens
```
and
```
completion_tokens
```
from API responses. Alert when consistently above 80% of limit.
Test boundary cases — verify behavior at exactly max context length. Ensure graceful degradation, not crashes.

Token budget template

Model: gpt-4o (128k context)
  System prompt:         ~2,000 tokens (fixed)
  Retrieved documents:  ~10,000 tokens (variable)
  Conversation history: ~10,000 tokens (sliding window)
  User message:          ~1,000 tokens (current turn)
  Generation headroom:   ~4,000 tokens (reserved for response)
  Safety margin:         ~1,000 tokens (buffer)
  ---
  Total budget:         ~28,000 tokens used of 128k

Key patterns

import tiktoken

def count_tokens(messages, model="gpt-4o"):
    enc = tiktoken.encoding_for_model(model)
    total = 0
    for msg in messages:
        total += 4  # message overhead
        total += len(enc.encode(msg["content"]))
    return total + 2  # reply priming

def trim_to_budget(messages, max_tokens, model="gpt-4o"):
    while count_tokens(messages, model) > max_tokens and len(messages) > 2:
        messages.pop(1)  # remove oldest non-system message
    return messages

Decision rules

Reserve at least 10% of context window for generation — starved output causes truncation.
Count tokens with the model's actual tokenizer — GPT-4 and Claude tokenize differently.
Prune user/assistant pairs together — orphaned messages confuse the model.
Summarize before discarding — "Earlier, we discussed X, Y, Z" preserves continuity.
Prefer shorter system prompts — they consume budget on every request.

References

Related skills

```
agent-memory
```
— agent memory architecture
```
embeddings-indexing
```
— retrieving relevant context from vector stores
```
model-selection
```
— choosing models with appropriate context windows