Claude-code-plugins-plus-skills palantir-incident-runbook
install
source · Clone the upstream repo
git clone https://github.com/jeremylongshore/claude-code-plugins-plus-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jeremylongshore/claude-code-plugins-plus-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/plugins/saas-packs/palantir-pack/skills/palantir-incident-runbook" ~/.claude/skills/jeremylongshore-claude-code-plugins-plus-skills-palantir-incident-runbook && rm -rf "$T"
manifest:
plugins/saas-packs/palantir-pack/skills/palantir-incident-runbook/SKILL.mdsource content
Palantir Incident Runbook
Overview
Rapid incident response for Foundry-related outages: API failures, transform build failures, authentication issues, and data pipeline stalls.
Prerequisites
- Access to application logs and Foundry build history
- Foundry service user credentials for health checks
- On-call escalation path defined
Instructions
Step 1: Triage (First 5 Minutes)
set -euo pipefail echo "=== Foundry Incident Triage ===" echo "Time: $(date -u)" # 1. Check if Foundry itself is down curl -s -o /dev/null -w "Foundry API: HTTP %{http_code}\n" \ -H "Authorization: Bearer $FOUNDRY_TOKEN" \ "https://$FOUNDRY_HOSTNAME/api/v2/ontologies" || echo "FOUNDRY UNREACHABLE" # 2. Check our app health curl -s http://localhost:8080/health | python -m json.tool # 3. Check recent error logs grep -c "ApiError\|status_code.*[45][0-9][0-9]" /var/log/app/app.log | tail -1
Step 2: Classify Severity
| Severity | Criteria | Response Time |
|---|---|---|
| P1 Critical | Foundry API completely unreachable, all operations failing | Immediate |
| P2 High | Intermittent 429/5xx errors, degraded performance | 15 minutes |
| P3 Medium | Single transform failing, non-critical pipeline stalled | 1 hour |
| P4 Low | Deprecation warnings, performance degradation | Next business day |
Step 3: Common Incident Playbooks
Playbook A: Authentication Failure (401/403)
# 1. Verify token is set echo "Token set: ${FOUNDRY_TOKEN:+yes}" echo "Token length: ${#FOUNDRY_TOKEN}" # 2. Test with a fresh token python -c " import os, foundry client = foundry.FoundryClient( auth=foundry.UserTokenAuth( hostname=os.environ['FOUNDRY_HOSTNAME'], token=os.environ['FOUNDRY_TOKEN'], ), hostname=os.environ['FOUNDRY_HOSTNAME'], ) print('Auth OK:', list(client.ontologies.Ontology.list())[0].api_name) " # 3. If still failing: regenerate credentials in Developer Console
Playbook B: Rate Limiting (429)
# 1. Check rate limit headers from last response # 2. Enable request throttling # 3. Review batch operations for unnecessary API calls # See palantir-rate-limits for detailed implementation
Playbook C: Transform Build Failure
1. Open Foundry > Pipeline Builder > failed build 2. Check the "Errors" tab for stack trace 3. Common causes: - OutOfMemoryError → add @configure(profile=["DRIVER_MEMORY_LARGE"]) - AnalysisException → column name mismatch (case-sensitive) - Input dataset empty → check upstream pipeline 4. Fix code, commit, trigger rebuild
Step 4: Escalation
Level 1: On-call engineer (your team) → Check logs, verify credentials, restart service Level 2: Platform team → Foundry enrollment issues, networking, VPN Level 3: Palantir support → Create ticket with debug bundle (palantir-debug-bundle) → Include: error codes, timestamps, request IDs
Step 5: Postmortem Template
## Incident: [Title] **Duration:** [start] to [end] ([X] minutes) **Severity:** P[1-4] **Impact:** [What was affected] ### Timeline - HH:MM — Alert fired - HH:MM — Investigation started - HH:MM — Root cause identified - HH:MM — Fix deployed - HH:MM — Verified resolution ### Root Cause [Description] ### Action Items - [ ] [Preventive measure 1] - [ ] [Preventive measure 2]
Output
- Incident triaged and classified within 5 minutes
- Appropriate playbook executed
- Escalation if needed with debug bundle
- Postmortem documented with action items
Error Handling
| Incident Type | First Action | Escalation Trigger |
|---|---|---|
| API unreachable | Check Foundry status | If Foundry is up but we cannot connect |
| Auth failure | Test with fresh token | If new token also fails |
| Rate limiting | Enable throttling | If throttling does not resolve |
| Build failure | Check error logs | If error is infrastructure-related |
Resources
Next Steps
For proactive monitoring, see
palantir-observability.