install
source · Clone the upstream repo
git clone https://github.com/Intense-Visions/harness-engineering
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/Intense-Visions/harness-engineering "$T" && mkdir -p ~/.claude/skills && cp -r "$T/agents/skills/codex/otel-sampling-strategies" ~/.claude/skills/intense-visions-harness-engineering-otel-sampling-strategies-4472f3 && rm -rf "$T"
manifest:
agents/skills/codex/otel-sampling-strategies/SKILL.mdsource content
OpenTelemetry Sampling Strategies
Control trace volume and costs with head sampling, tail sampling, and priority-based strategies
When to Use
- Trace volume exceeds backend capacity or budget
- High-traffic services generating millions of traces per hour
- Needing to keep 100% of error traces while sampling normal traffic
- Balancing observability coverage with storage costs
Instructions
- Start with AlwaysOn in development and staging. Add sampling only when volume is a problem.
- Head sampling (SDK-level): decide whether to record at trace start. Simple but cannot consider outcome.
- Tail sampling (Collector-level): decide whether to export after the trace is complete. Can keep all errors and sample successes.
- Use
for probabilistic head sampling.TraceIdRatioBasedSampler - Use
to respect upstream sampling decisions (if the parent was sampled, the child should be too).ParentBasedSampler - For tail sampling, deploy the OpenTelemetry Collector with the
processor.tailsampling
// Head sampling — SDK-level import { NodeSDK } from '@opentelemetry/sdk-node'; import { TraceIdRatioBasedSampler, ParentBasedSampler, AlwaysOnSampler, } from '@opentelemetry/sdk-trace-base'; // Sample 10% of traces, but respect parent decisions const sampler = new ParentBasedSampler({ root: new TraceIdRatioBasedSampler(0.1), // 10% of root spans // If parent was sampled, always sample child // If parent was not sampled, never sample child }); const sdk = new NodeSDK({ sampler, // ... });
// Custom sampler — always sample errors and slow requests import { Sampler, SamplingDecision, SamplingResult } from '@opentelemetry/sdk-trace-base'; class PrioritySampler implements Sampler { private ratioSampler = new TraceIdRatioBasedSampler(0.1); shouldSample( context: Context, traceId: string, spanName: string, spanKind: SpanKind, attributes: Attributes ): SamplingResult { // Always sample health checks out if (spanName.includes('/health')) { return { decision: SamplingDecision.NOT_RECORD }; } // Always sample specific routes if (spanName.includes('/api/payments')) { return { decision: SamplingDecision.RECORD_AND_SAMPLED }; } // Default: ratio-based return this.ratioSampler.shouldSample(context, traceId, spanName, spanKind, attributes); } toString(): string { return 'PrioritySampler'; } }
# Tail sampling in OpenTelemetry Collector processors: tail_sampling: decision_wait: 10s # Wait for all spans in a trace num_traces: 100000 # Max traces held in memory policies: # Always keep error traces - name: errors type: status_code status_code: status_codes: [ERROR] # Always keep slow traces (> 2s) - name: slow-traces type: latency latency: threshold_ms: 2000 # Sample 10% of everything else - name: default type: probabilistic probabilistic: sampling_percentage: 10 # Always keep payment-related traces - name: payments type: string_attribute string_attribute: key: http.route values: ['/api/payments.*'] service: pipelines: traces: processors: [tail_sampling, batch]
Details
Head vs tail sampling:
| Head Sampling | Tail Sampling | |
|---|---|---|
| Decision point | Trace start | Trace end |
| Can consider outcome | No | Yes (errors, latency) |
| Resource cost | Low (decide once) | High (buffer all spans) |
| Implementation | SDK sampler | Collector processor |
| Consistency | All spans in trace agree | All spans in trace agree |
Recommended strategy for production:
- Head sampling at 100% for critical services, 10-25% for high-volume services
- Tail sampling in the Collector: keep 100% of errors, keep 100% of slow traces, probabilistic sample the rest
- Always sample out health checks and readiness probes
ParentBasedSampler is critical: Without it, a sampled parent trace can have unsampled children, creating broken traces. Always wrap your root sampler with
ParentBasedSampler.
Cost estimation: A typical span is 200-500 bytes. At 1000 requests/second with 10 spans per request, that is 10K spans/second. At 10% sampling, 1K spans/second (roughly 500 KB/s or 1.3 TB/month).
Environment variable control:
OTEL_TRACES_SAMPLER=parentbased_traceidratio OTEL_TRACES_SAMPLER_ARG=0.1 # 10%
Source
https://opentelemetry.io/docs/concepts/sampling/
Process
- Read the instructions and examples in this document.
- Apply the patterns to your implementation, adapting to your specific context.
- Verify your implementation against the details and edge cases listed above.
Harness Integration
- Type: knowledge — this skill is a reference document, not a procedural workflow.
- No tools or state — consumed as context by other skills and agents.
Success Criteria
- The patterns described in this document are applied correctly in the implementation.
- Edge cases and anti-patterns listed in this document are avoided.