Awesome-omni-skill context-engineering

Manages context window for cost, performance, and quality in Claude API workflows and agent sessions. Use when building multi-step agent pipelines, designing system prompts for repeated API calls, managing long coding sessions, or debugging degraded agent performance. Covers KV cache optimization, token budgeting, subagent delegation, compaction, attention mechanics, and the 200K pricing cliff.

install

source · Clone the upstream repo

git clone https://github.com/diegosouzapw/awesome-omni-skill

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/diegosouzapw/awesome-omni-skill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/development/context-engineering" ~/.claude/skills/diegosouzapw-awesome-omni-skill-context-engineering-dee170 && rm -rf "$T"

manifest: skills/development/context-engineering/SKILL.md

source content

Context Engineering

Purpose

Every token sent to an LLM costs money, increases latency, and past ~32K tokens degrades model performance. Context engineering is the discipline of delivering the smallest possible set of high-signal tokens that achieves the desired outcome — across both individual sessions and production API pipelines.

The difference between a $0.50 query and a $5.00 query is context management, not prompt engineering.

When to Use

Agent pipeline design: Before building any multi-step Claude API workflow
System prompt architecture: When structuring prompts for repeated calls
Long coding sessions: Context approaching limits (70%+ utilization)
Degraded performance: Agent losing track mid-session
Cost review: When agent costs exceed expectations
Scale prep: Before moving from prototype to production volume

Core Concept

Context is everything available to the model at inference time: system prompts, tool definitions, retrieved documents, message history, and tool outputs (often 80%+ of total tokens).

The context window is not just a size limit — it's an attention budget that depletes as context grows. Models exhibit the "lost-in-the-middle" phenomenon: information at the beginning and end gets strong attention, the middle gets lost.

The Triple Penalty

Every unnecessary token hits you three ways:

Cost — Opus 4.6: $15/M input, $75/M output. Cached input: $1.875/M (8x cheaper)
Latency — More tokens = slower responses
Quality — Past 32K tokens, sharp performance degradation (context rot)

Output tokens cost 5x input tokens. This is the single most important cost fact.

Part 1: Prompt Architecture

Stable Prefixes for KV Cache Hits

The most important optimization for production agents.

LLMs process prompts token by token. If your prompt starts identically to a previous request, the model reuses cached key-value computations for that prefix. Cached tokens cost 8-10x less than uncached.

Prompt Layering (Always Follow This Order)

┌─────────────────────────────────────────┐
│  LAYER 1: STATIC (cached across ALL calls)    │
│  - System instructions                         │
│  - Tool definitions (never add/remove dynamically) │
│  - Skills (anti-slop, analysis frameworks)     │
│  - Few-shot examples                           │
├─────────────────────────────────────────┤
│  LAYER 2: SEMI-STATIC (cached per workspace)  │
│  - Workspace context (ICP, competitors)        │
│  - Content pillars, voice calibration          │
│  - Client-specific configuration               │
├─────────────────────────────────────────┤
│  LAYER 3: DYNAMIC (changes per call)           │
│  - The actual content to analyze/process       │
│  - Current user request                        │
│  - Timestamps (date only, never seconds)       │
└─────────────────────────────────────────┘

Cache Killers (Never Do These)

✗ Timestamps with seconds/milliseconds at prompt start
✗ Dynamic content before static content
✗ Randomly ordered tool definitions
✗ Non-deterministic JSON serialization (use sort_keys or consistent key ordering)
✗ Conditionally adding/removing tools between calls

Implementation

// GOOD: Stable prefix, dynamic content last
const messages = [
  {
    role: "system",
    content: [
      { type: "text", text: STATIC_INSTRUCTIONS, cache_control: { type: "ephemeral" } },
      { type: "text", text: workspaceContext, cache_control: { type: "ephemeral" } },
      { type: "text", text: dynamicContent } // No cache control — this changes
    ]
  }
];

// BAD: Timestamp at start kills all caching
const messages = [
  {
    role: "system",
    content: `Current time: ${new Date().toISOString()}\n\n${STATIC_INSTRUCTIONS}`
  }
];

Cache TTL is 5 minutes (Anthropic) to 10 minutes (OpenAI). Including date is fine; including seconds guarantees zero cache hits. Route requests by session ID to maximize warm cache hits.

Attention-Favored Positions

Place critical information where attention is strongest:

Content	Position	Why
System instructions	Beginning	Highest attention
Current task/goal	Beginning	Sets frame
Critical constraints	Beginning or end	Avoid middle dead zone
Historical context	Middle	Acceptable loss
Supporting documents	Middle	Lower priority
Recent conversation	End	Recency bias
Current objectives	End	Combats drift

The Todo Pattern

For multi-step agent workflows, maintain a

current_objectives

string appended to the end of context. This "recites" what the agent should be doing, preventing drift across long tool-call chains.

const contextSuffix = `
## Current Objectives
- Analyzing newsletter batch (3/8 complete)
- Priority: identify content opportunities for AI pillar
- Next: process remaining 5 items, then compile digest
`;

Part 2: Token Reduction

Progressive Disclosure

Load information only when needed, not upfront. This mirrors human cognition — we don't memorize entire codebases, we use indexes to retrieve relevant information on demand.

Bad: Load all skill content at session start Good: Load skill names/descriptions; full content loads on activation

Bad: Dump entire file contents into context Good: Load file metadata first, then specific sections as needed

Store Tool Outputs in Filesystem, Not Context

Cursor's A/B testing showed this reduces total agent tokens by 46.9%.

Agents don't need complete information in the conversation. They need the ability to access information on demand.

// BAD: Stuffing tool output into conversation
const result = await searchDatabase(query);
messages.push({ role: "tool", content: JSON.stringify(result) }); // 5,000 tokens

// GOOD: Write to file, pass reference
const result = await searchDatabase(query);
await writeFile(`/tmp/search-${id}.json`, JSON.stringify(result));
messages.push({ role: "tool", content: `Results written to /tmp/search-${id}.json. ${result.length} items found. Use read_file to inspect specific items.` }); // 50 tokens

Apply to: shell command outputs, search results, API responses, newsletter content, enrichment data. Store full content, pass summary + file path.

Design Two-Phase Tools

A vague tool returns everything. A precise tool returns exactly what the agent needs.

// Phase 1: Lightweight search returns metadata only
async function searchContacts(query: string, filters: ContactFilters) {
  // Returns: [{ id, name, company, title, relevance_score }]
}

// Phase 2: Agent decides which contacts deserve full data
async function getContactDetail(contactId: string) {
  // Returns: Full enrichment, email history, research notes
}

Tool	Phase 1 (cheap)	Phase 2 (on demand)
Intel search	Titles + snippets + scores	Full article content
Contact lookup	Name + company + title	Full enrichment + research
Email search	Subject + sender + date	Full email body
Company research	Summary + signals	Full analysis report

Each filter parameter on Phase 1 reduces returned tokens by up to 10x.

Clean Data Before It Enters Context

Garbage tokens are still tokens. Preprocess aggressively before any LLM call.

function cleanForAnalysis(html: string): string {
  return pipeline(html, [
    stripUnsubscribeLinks,
    stripTrackingPixels,
    stripNavigationChrome,
    convertHtmlToMarkdown,  // Markdown uses ~50% fewer tokens than HTML
    collapseWhitespace,
    truncateToMaxWords(3000)
  ]);
}

Content type	Raw tokens	After cleaning	Savings
HTML email	~10,000	~2,000	80%
Web page	~25,000	~3,000	88%
SEC filing	~50,000	~5,000	90%
API response	~3,000	~500	83%

Rule: Convert HTML to Markdown before any LLM call. Always.

Templates Over Regeneration

Output tokens cost 5x input tokens. Stop regenerating the same patterns from scratch.

// BAD: Regenerate from scratch every time ($0.50 per email)
const email = await claude.generate("Write a cold outreach email to {prospect}...");

// GOOD: Load template, fill personalization ($0.05 per email)
const template = await loadTemplate("cold-outreach-v2");
const hooks = await claude.generate(
  `Given this research on ${prospect.name}, generate personalization hooks.
   Return JSON only: { hooks: string[], opener: string, relevance: string }`,
  { max_tokens: 200 }
);
const email = applyTemplate(template, hooks);

CAIO applications: outreach sequences, intel analysis (structured JSON, not prose), content drafts (pillar framework + unique parts only), client reports.

Part 3: Model & Cost Optimization

Subagent Delegation Matrix

Not every task needs your most expensive model. The orchestrator should see condensed results, not raw context.

Task	Model	Why
Data extraction	Haiku	Structured, low-judgment
Classification (reply type, intent)	Haiku	Binary/categorical decisions
Summarization	Sonnet	Needs comprehension, not creativity
Email personalization	Sonnet	Needs quality, not genius
ICP scoring	Sonnet	Structured rubric, moderate judgment
Newsletter analysis	Sonnet	Comprehension + extraction
Content drafting	Opus	Needs voice, judgment, creativity
Strategic analysis	Opus	Needs reasoning, synthesis

Context Isolation Pattern

// BAD: One agent accumulates everything
for (const newsletter of newsletters) {
  agent.addContext(newsletter.fullText); // Context grows with every item
}

// GOOD: Subagents with isolated context
async function analyzeNewsletter(newsletter: Newsletter) {
  return new Agent({
    model: "sonnet",
    systemPrompt: ANALYSIS_INSTRUCTIONS, // Cached across calls
    context: newsletter.fullText          // Isolated per call
  }).analyze(); // Returns condensed JSON
}

const analyses = await Promise.all(newsletters.map(analyzeNewsletter));

Design subagent tasks for single-turn completion. More iterations = more context accumulation.

Output Token Budgeting

Set task-appropriate

max_tokens

. Don't leave it unlimited.

const TASK_LIMITS: Record<string, number> = {
  classification: 50,
  extraction: 200,
  personalization: 300,
  short_answer: 500,
  analysis: 2000,
  content_draft: 3000,
  code_generation: 4000,
};

Default to JSON output for agent-to-agent communication. Prose for human-facing content only.

// PROSE (expensive): "The company's revenue was 94.5 billion dollars,
// representing a 12.3 percent increase year-over-year..."

// JSON (cheap): {"revenue": 94.5, "unit": "B", "yoy_change": 12.3}

The 200K Pricing Cliff

Crossing 200K input tokens doubles per-token cost. This is a cliff, not a gradient.

Model	Under 200K	Over 200K
Opus input	$15/M	$30/M
Opus output	$75/M	$112.50/M
Sonnet input	$3/M	$6/M
Sonnet output	$15/M	$22.50/M

class ContextBudget {
  private cumulativeTokens = 0;
  private readonly CLIFF = 180_000; // Buffer before 200K

  async addToolResult(tokenCount: number) {
    this.cumulativeTokens += tokenCount;
    if (this.cumulativeTokens > this.CLIFF) {
      await this.compactContext();
    }
  }
}

The cost of a compression step is far less than doubling your per-token rate for the rest of the conversation.

Parallel Tool Calls

Every sequential tool call re-sends full conversation context. Parallel calls reduce round trips.

// BAD: 5 sequential calls = 5x context transmission
const company = await getCompany(id);
const contacts = await getContacts(id);
const deals = await getDeals(id);

// GOOD: Model requests multiple tool calls in single response
// Design tools so independent operations can be batched

Fewer round trips = less context accumulation = cheaper and faster.

Application-Level Response Caching

The cheapest token is the one you never send to the API.

class AnalysisCache {
  async getOrAnalyze(contentHash: string, analyzer: () => Promise<Analysis>) {
    const cached = await this.cache.get(contentHash);
    if (cached && !this.isStale(cached)) return cached; // $0.00
    const result = await analyzer();
    await this.cache.set(contentHash, result);
    return result;
  }
}

Good cache candidates: newsletter analysis, company research (refresh weekly), ICP scores (refresh on config change), template outputs.

Bad cache candidates: content drafts (should vary), personalization hooks (context-dependent), reply classification (unique per reply).

Part 4: Session Management

Context Budget Allocation

For long-running agent sessions, design with explicit budgets:

Component	Typical %	Notes
System prompt	5-10%	Stable, loads once
Tool definitions	10-15%	Stable across session
Retrieved docs	20-30%	Dynamic, load on demand
Message history	30-40%	Grows over session
Tool outputs	Variable	Can dominate if unchecked
Reserved buffer	10%	Always keep headroom

Compaction Triggers

Monitor and act when:

70% utilization: Consider compaction
80% utilization: Actively compress
90% utilization: Critical — summarize aggressively

Compaction Priority Order

When approaching limits, compress in this priority:

Tool outputs — Summarize findings/metrics, remove raw data
Old conversation turns — Distill to decisions and commitments
Retrieved documents — Extract key facts only
NEVER compress — System prompts, active task context

Observation Masking

Tool outputs served their purpose once the decision was made. Replace verbose outputs with compact references:

[Obs:ref_123 elided. Key: 15 files found, 3 with errors]

Never mask: critical observations, recent turns (last 2-3), active reasoning chains. Always mask: repeated outputs, boilerplate headers/footers, already-summarized content.

Server-Side Compaction

For long-running sessions (50+ tool calls), use Anthropic's server-side compaction:

const response = await anthropic.messages.create({
  model: "claude-sonnet-4-5-20250514",
  max_tokens: 4096,
  context_window_strategy: {
    type: "summarize",
    trigger_tokens: 150_000, // Before 200K cliff
    preserve_instructions: "Preserve all numerical data, company names, contact details, and analytical conclusions. Discard raw tool outputs and intermediate reasoning."
  },
  messages: conversationHistory
});

Stack with prompt caching: add cache breakpoints on system prompt so it stays cached when compaction occurs.

Part 5: Integration with Harness

The harness implements several context-engineering patterns:

Skills system: Progressive disclosure of capability knowledge
Phase-based development: Limits scope to manageable chunks
Ticket system: Focused task context per work item
Build log: Offloads historical context to filesystem
Subagent definitions: Isolated context per agent type

Filesystem as Extended Memory

progress/
  build-log.md      # Session history
  decisions/        # Key decisions made
  context-cache/    # Temporary context storage

Before summarizing away context, write it to a file. Reference the file path instead of keeping content in context.

Cost Estimation Worksheet

Use when speccing new agent features:

Feature: [name]
Calls per task: [N]
Model: [opus/sonnet/haiku]

Input tokens per call:
  System prompt (cached):     _____ tokens × $1.875/M = $_____
  Workspace context (cached): _____ tokens × $1.875/M = $_____
  Dynamic content (uncached): _____ tokens × $15/M    = $_____

Output tokens per call:
  Expected output:            _____ tokens × $75/M    = $_____

Per-task cost: $_____ × [N calls] = $_____
Monthly volume: _____ tasks × $_____ = $_____

Optimization targets:
  Cache hit rate:              ____%
  Subagent delegation savings: ____%
  Template savings:            ____%
  Optimized monthly cost:      $_____

CAIO Reference Costs (Optimized)

Workflow	Calls	Est. Cost/task	Monthly (CAIO)
Newsletter analysis	1	$0.02	$1.60 (80/mo)
Content draft	1	$0.05	$2.00 (40/mo)
Weekly digest	1	$0.10	$0.40
Prospect research	3	$0.08	$16.00 (200/mo)
Email personalization	1	$0.03	$6.00 (200/mo)
Reply classification	1	$0.01	$2.00 (200/mo)

Pre-Build Checklist

Before implementing any agent workflow:

□ System prompt uses stable prefix pattern (static → semi-static → dynamic)
□ Tool definitions are static (no conditional add/remove)
□ JSON serialization is deterministic
□ Tool outputs go to filesystem, not conversation context
□ Tools use two-phase pattern (search → retrieve)
□ Each subtask assigned to cheapest capable model
□ Output token limits set per task type
□ Data cleaned before entering context (HTML → Markdown minimum)
□ Application-level caching for repeated queries
□ Context budget tracks cumulative tokens vs 200K cliff
□ Independent tool calls designed for parallel execution
□ Critical information placed at beginning or end, never middle
□ Compaction triggers set at 70/80/90% utilization
□ Cost estimate completed with monthly projections

Guidelines Summary

Treat context as finite with diminishing returns
Stable prefixes first — this is the #1 cost lever
Output tokens cost 5x input — budget accordingly
Use progressive disclosure — defer loading until needed
Store tool outputs in filesystem, not conversation
Delegate to cheapest capable model
Monitor utilization and trigger compaction at 70-80%
Place critical information at beginning and end, never middle
Clean all data (HTML → Markdown) before it enters context
The 200K cliff doubles your cost — stay under it
Cache at the application level — cheapest token is one you never send
Design for graceful degradation, not avoidance

References

Fintool: The Context Tax — Production cost optimization patterns
Manus: Context Engineering — KV cache and stable prefix architecture
Agent-Skills-for-Context-Engineering
Anthropic's context window research
"Lost in the Middle" attention mechanics studies