Skilllibrary context-management-memory
install
source · Clone the upstream repo
git clone https://github.com/merceralex397-collab/skilllibrary
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/merceralex397-collab/skilllibrary "$T" && mkdir -p ~/.claude/skills && cp -r "$T/11-ai-llm-runtime-and-integration/context-management-memory" ~/.claude/skills/merceralex397-collab-skilllibrary-context-management-memory && rm -rf "$T"
manifest:
11-ai-llm-runtime-and-integration/context-management-memory/SKILL.mdsource content
Purpose
Manage LLM context windows: token counting, budget allocation, message pruning, and context compression techniques.
When to use this skill
- hitting context length limits and need to fit more content
- implementing token counting for prompt budget management
- designing message truncation or sliding window strategies
- compressing context with summarization or extraction
Do not use this skill when
- designing agent memory stores — prefer
agent-memory - setting up vector databases — prefer
embeddings-indexing - choosing which model to use — prefer
model-selection
Procedure
- Count tokens accurately — use
for OpenAI models, model-specific tokenizers for others. Never estimate by word count.tiktoken - Set budget allocation — divide context: system prompt (fixed), retrieved context (variable), conversation history (sliding), generation headroom (reserved).
- Implement sliding window — keep last N messages. When over budget, remove oldest user/assistant pairs (keep system prompt).
- Summarize overflow — when pruning messages, summarize removed content into a single system message. Preserve key facts.
- Compress retrieved context — extract relevant sentences from documents instead of including full text. Use LLM extraction if needed.
- Use structured prompts — JSON/YAML-structured prompts are more token-efficient than verbose natural language for data-heavy contexts.
- Monitor usage — log
andprompt_tokens
from API responses. Alert when consistently above 80% of limit.completion_tokens - Test boundary cases — verify behavior at exactly max context length. Ensure graceful degradation, not crashes.
Token budget template
Model: gpt-4o (128k context) System prompt: ~2,000 tokens (fixed) Retrieved documents: ~10,000 tokens (variable) Conversation history: ~10,000 tokens (sliding window) User message: ~1,000 tokens (current turn) Generation headroom: ~4,000 tokens (reserved for response) Safety margin: ~1,000 tokens (buffer) --- Total budget: ~28,000 tokens used of 128k
Key patterns
import tiktoken def count_tokens(messages, model="gpt-4o"): enc = tiktoken.encoding_for_model(model) total = 0 for msg in messages: total += 4 # message overhead total += len(enc.encode(msg["content"])) return total + 2 # reply priming def trim_to_budget(messages, max_tokens, model="gpt-4o"): while count_tokens(messages, model) > max_tokens and len(messages) > 2: messages.pop(1) # remove oldest non-system message return messages
Decision rules
- Reserve at least 10% of context window for generation — starved output causes truncation.
- Count tokens with the model's actual tokenizer — GPT-4 and Claude tokenize differently.
- Prune user/assistant pairs together — orphaned messages confuse the model.
- Summarize before discarding — "Earlier, we discussed X, Y, Z" preserves continuity.
- Prefer shorter system prompts — they consume budget on every request.
References
Related skills
— agent memory architectureagent-memory
— retrieving relevant context from vector storesembeddings-indexing
— choosing models with appropriate context windowsmodel-selection