Claude-code-plugins-plus-skills palantir-incident-runbook

install
source · Clone the upstream repo
git clone https://github.com/jeremylongshore/claude-code-plugins-plus-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jeremylongshore/claude-code-plugins-plus-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/plugins/saas-packs/palantir-pack/skills/palantir-incident-runbook" ~/.claude/skills/jeremylongshore-claude-code-plugins-plus-skills-palantir-incident-runbook && rm -rf "$T"
manifest: plugins/saas-packs/palantir-pack/skills/palantir-incident-runbook/SKILL.md
source content

Palantir Incident Runbook

Overview

Rapid incident response for Foundry-related outages: API failures, transform build failures, authentication issues, and data pipeline stalls.

Prerequisites

  • Access to application logs and Foundry build history
  • Foundry service user credentials for health checks
  • On-call escalation path defined

Instructions

Step 1: Triage (First 5 Minutes)

set -euo pipefail
echo "=== Foundry Incident Triage ==="
echo "Time: $(date -u)"

# 1. Check if Foundry itself is down
curl -s -o /dev/null -w "Foundry API: HTTP %{http_code}\n" \
  -H "Authorization: Bearer $FOUNDRY_TOKEN" \
  "https://$FOUNDRY_HOSTNAME/api/v2/ontologies" || echo "FOUNDRY UNREACHABLE"

# 2. Check our app health
curl -s http://localhost:8080/health | python -m json.tool

# 3. Check recent error logs
grep -c "ApiError\|status_code.*[45][0-9][0-9]" /var/log/app/app.log | tail -1

Step 2: Classify Severity

SeverityCriteriaResponse Time
P1 CriticalFoundry API completely unreachable, all operations failingImmediate
P2 HighIntermittent 429/5xx errors, degraded performance15 minutes
P3 MediumSingle transform failing, non-critical pipeline stalled1 hour
P4 LowDeprecation warnings, performance degradationNext business day

Step 3: Common Incident Playbooks

Playbook A: Authentication Failure (401/403)

# 1. Verify token is set
echo "Token set: ${FOUNDRY_TOKEN:+yes}"
echo "Token length: ${#FOUNDRY_TOKEN}"

# 2. Test with a fresh token
python -c "
import os, foundry
client = foundry.FoundryClient(
    auth=foundry.UserTokenAuth(
        hostname=os.environ['FOUNDRY_HOSTNAME'],
        token=os.environ['FOUNDRY_TOKEN'],
    ),
    hostname=os.environ['FOUNDRY_HOSTNAME'],
)
print('Auth OK:', list(client.ontologies.Ontology.list())[0].api_name)
"
# 3. If still failing: regenerate credentials in Developer Console

Playbook B: Rate Limiting (429)

# 1. Check rate limit headers from last response
# 2. Enable request throttling
# 3. Review batch operations for unnecessary API calls
# See palantir-rate-limits for detailed implementation

Playbook C: Transform Build Failure

1. Open Foundry > Pipeline Builder > failed build
2. Check the "Errors" tab for stack trace
3. Common causes:
   - OutOfMemoryError → add @configure(profile=["DRIVER_MEMORY_LARGE"])
   - AnalysisException → column name mismatch (case-sensitive)
   - Input dataset empty → check upstream pipeline
4. Fix code, commit, trigger rebuild

Step 4: Escalation

Level 1: On-call engineer (your team)
  → Check logs, verify credentials, restart service

Level 2: Platform team
  → Foundry enrollment issues, networking, VPN

Level 3: Palantir support
  → Create ticket with debug bundle (palantir-debug-bundle)
  → Include: error codes, timestamps, request IDs

Step 5: Postmortem Template

## Incident: [Title]
**Duration:** [start] to [end] ([X] minutes)
**Severity:** P[1-4]
**Impact:** [What was affected]

### Timeline
- HH:MM — Alert fired
- HH:MM — Investigation started
- HH:MM — Root cause identified
- HH:MM — Fix deployed
- HH:MM — Verified resolution

### Root Cause
[Description]

### Action Items
- [ ] [Preventive measure 1]
- [ ] [Preventive measure 2]

Output

  • Incident triaged and classified within 5 minutes
  • Appropriate playbook executed
  • Escalation if needed with debug bundle
  • Postmortem documented with action items

Error Handling

Incident TypeFirst ActionEscalation Trigger
API unreachableCheck Foundry statusIf Foundry is up but we cannot connect
Auth failureTest with fresh tokenIf new token also fails
Rate limitingEnable throttlingIf throttling does not resolve
Build failureCheck error logsIf error is infrastructure-related

Resources

Next Steps

For proactive monitoring, see

palantir-observability
.