opus-mind
Use when the user edits CLAUDE.md / AGENTS.md / .cursorrules / GEMINI.md / **/SKILL.md or any chatbot system prompt (LINT), OR when the user wants to tighten a vague one-shot prompt before sending it to an LLM (BOOST). Fires on audit/score/review/fix requests, on symptoms like refuse-relent, narration leak, rule conflict, and on "help me improve this prompt" messages.
git clone https://github.com/Hybirdss/opus-mind
T=$(mktemp -d) && git clone --depth=1 https://github.com/Hybirdss/opus-mind "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/opus-mind" ~/.claude/skills/hybirdss-opus-mind-opus-mind && rm -rf "$T"
skills/opus-mind/SKILL.mdopus-mind
Two products, one skill. LINT audits production system prompts against 11 structural invariants reverse-engineered from the leaked Claude Opus 4.7 system prompt. BOOST coaches a user's single request against 10 slots — 7 for specification quality and 3 for reasoning quality (chain-of-thought, verification, decomposition).
Python helpers are deterministic (regex, counts, string templates). Synthesis — composing rewrites, applying semantic review, judging domain context — is done by you, the Claude running this session. No API key, no extra cost, no shell-out.
When to use
- User edits or audits:
,CLAUDE.md
,AGENTS.md
,.cursorrules
,GEMINI.md
,**/SKILL.md
, or a chatbot system prompt → LINT.system-prompt*.md - User pastes a vague one-shot request meant for Claude / ChatGPT / Cursor and wants it concrete → BOOST.
- User describes a symptom — refuse-relent, narration leak, rule conflict, adjective drift, jailbreak, injection, tool-call drift → LINT Debug.
When NOT to use
- The target file has fewer than 3 directives or fewer than 10 lines.
Run audit anyway and quote the
verdict back at the user verbatim; do not invent coverage.THIN - The user wants a generic "make this better" with no file, no pasted text, and no repo context. Ask for the concrete prompt first.
- The request is about Claude's own safety policy (e.g. "why did Claude refuse X?"). Point to Anthropic docs. This skill is about prompt structure, not safety interpretation.
Routing — first-match-wins, stop at match
- Input is a file path ending in
/.md
, or an inline system prompt (contains.cursorrules
,{role}
,Tier N
,refuse
, or directive-heavy content) → Flow A: LINT.decline - Input is a short natural-language request ("write me X", "help with Y", a one-liner the user plans to send to an assistant) → Flow B: BOOST.
- Input is a symptom only — no file, no prompt to rewrite, just "my bot keeps doing X" → Flow C: Debug.
Do not mix flows in a single turn. Pick one, finish it, stop.
Data contracts (JSON you will parse)
Every Python helper supports
--json. The skill treats these
schemas as the source of truth — keys are stable, additions safe,
removals or renames require a schema version bump.
audit.py --json <path>
audit.py --json <path>{ "schema_version": "1.0", "path": "CLAUDE.md", "line_count": 220, "score": "8/11", "structural_health": "8/11", "verdict": "BORDERLINE", "thin_reason": null, "placeholder_count": 0, "pass": { "I1_reduce_interpretation": true, "I2_no_rule_conflicts": false, "..." : "..." }, "metrics": { "hedges": 2, "directives": 26, "...": "..." }, "findings": [ { "invariant": "I2", "line": 0, "snippet": "", "issue": "26 directives, 0 ladders", "fix_pointer": "references/primitives/02-decision-ladders.md" } ] }
Key fields to read:
verdict (THIN / POOR / BORDERLINE / GOOD),
pass (per-invariant boolean map), findings (per-violation
details with line refs), placeholder_count (skeleton markers
left unfilled).
plan.py --json <path>
plan.py --json <path>{ "path": "CLAUDE.md", "score": "8/11", "domain": { "has_tools": true, "has_refusals": true, "is_long": true }, "required_invariants": ["I1_...", "I2_...", "..."], "missing_required": ["I2_no_rule_conflicts", "I9_self_check"], "passing_required": ["I1_...", "..."], "primitive_detections": { "01": "high", "02": "absent", "..." : "..." } }
missing_required is what you rank for improvement.
boost.py check --json <prompt>
boost.py check --json <prompt>{ "source": "<inline>", "coverage": "1/10", "filled_count": 1, "slots": { "B1": { "label": "task", "filled": true, "evidence": ["write a"] }, "B2": { "label": "format", "filled": false, "evidence": [] }, "B3": { "label": "length", "filled": false, "evidence": [] }, "B4": { "label": "context", "filled": false, "evidence": [] }, "B5": { "label": "few_shot", "filled": false, "evidence": [] }, "B6": { "label": "constraints", "filled": false, "evidence": [] }, "B7": { "label": "clarify", "filled": false, "evidence": [] }, "B8": { "label": "reasoning", "filled": false, "evidence": [] }, "B9": { "label": "verification", "filled": false, "evidence": [] }, "B10": { "label": "decomposition", "filled": false, "evidence": [] } } }
Slots split into two layers:
- Specification (B1-B7): what Claude should produce and for whom. Grounded in Anthropic public prompt-engineering docs.
- Reasoning (B8-B10): how Claude should think. Grounded in
(Wei 2022 CoT, Shinn 2023 Reflexion, Zhou 2022 Least-to-most, Anthropic "Let Claude think").evidence/smart-prompting-refs.md
decode.py --json <path>
and symptom_search.py --json <query>
decode.py --json <path>symptom_search.py --json <query>Both emit detection lists you quote by line range and confidence.
Flow A — LINT
Phase 1: Gather
Run the deterministic pass:
python3 "$SKILL_DIR/scripts/audit.py" --json "<path>" python3 "$SKILL_DIR/scripts/plan.py" --json "<path>"
If the file lives inside a repo, use Read/Grep to skim sibling context (
README.md, package.json, AGENTS.md) — only enough
to infer the project's role (agent, chatbot, code assistant,
support bot). You are not auditing those files.
Phase 2: Synthesize
- Parse both JSON payloads.
- If
, stop and tell the user the file is too thin to audit — quote theverdict == "THIN"
field.thin_reason - Rank the top findings by:
- Required for this domain (present in
)plan.missing_required - Severity (I1
, I4 narration > 0, I6 consequences < directives/10 are heavier than soft gaps)hedge_density > 0.25 - Fixability via
first, manual rewording secondfix --add
- Required for this domain (present in
- Pick the top 3 failing invariants. For each:
- Read the primitive doc at
or technique doc atreferences/primitives/NN-*.mdreferences/techniques/NN-*.md - Extract the
section (≤ 2 sentences)## TL;DR
- Read the primitive doc at
- Note any
— the author injected skeletons but did not fill them.placeholder_count > 0
Phase 3: Respond in prose
Lead with the verdict, then 3 findings with line refs, then one concrete next action. Do not dump raw JSON. Example shape:
Your CLAUDE.md scores 6/11 (BORDERLINE). Three things move the verdict: 1. I2 decision-ladders — 26 directives, no "Step N → ..." ladder. Primitive 02: routing written as ordered steps with first-match-wins, not as an unordered list. Fix: `opus-mind lint fix CLAUDE.md --add ladder`. 2. I6 consequences at L42 — ... 3. I1 hedges at L88 — ... Next: run the fix above, then `opus-mind lint report CLAUDE.md` to re-verify. Expected lift: BORDERLINE → GOOD.
Phase 4: Offer the fix, wait for consent
If the user agrees, run:
python3 "$SKILL_DIR/scripts/fix.py" "<path>" --add "<keys>" --apply
Then re-run Phase 1-3 so the user sees the score delta in the same reply. Warn them:
fix --add injects skeletons with
<FIXME> markers. They need to fill the markers with
domain-specific wording before commit, or the verdict stays
below GOOD (placeholder penalty).
Phase 5 — Crosscheck (on request)
When the user asks for a semantic review beyond regex:
python3 "$SKILL_DIR/scripts/audit.py" --crosscheck "<path>"
The script prints a structured review prompt. Read it. Apply it as a second reviewer in your next reply: list false positives the regex caught wrongly, additional findings (rule conflicts, consequence mismatches) regex missed, per-invariant severity deltas. No API call. You are the second reviewer.
Flow B — BOOST
Phase 1: Check
python3 "$SKILL_DIR/scripts/boost.py" check --json "<prompt>"
Parse the coverage (
filled_count / 10) and empty slots. The JSON
payload also carries:
: one oftask_type
/code
/analyze
/research
/write
/short
— inferred from the prompt's verbs and nouns.unknown
: the 10 slots pre-ranked for this task type. Code tasks surface B10 first; analysis surfaces B8; creative writing surfaces B4; short one-offs surface B2 and skip the reasoning layer entirely.impact_order
Phase 2: Ask — one question at a time, in impact_order
impact_orderDo not dump all empty slots as a list. Walk
impact_order from
the JSON, pick the FIRST empty slot, ask ONE question using
AskUserQuestion (Claude Code) / request_user_input (Codex)
/ ask_user (Gemini). Wait for the answer. Re-run check (or
merge the answer mentally), then walk impact_order again for
the next empty slot. Stop when coverage ≥ 7/10 or the user
signals done.
Task-type examples for reference (the JSON already hands you the right order; this table is a sanity-check):
| task_type | top-3 slots to ask about first |
|---|---|
| code | B10 decomposition → B8 reasoning → B9 verification |
| analyze | B8 reasoning → B9 verification → B4 context |
| research | B9 verification → B8 reasoning → B4 context |
| write | B4 context → B6 constraints → B3 length |
| short | B2 format → B3 length → B6 constraints |
| unknown | B3 length → B4 context → B2 format |
For B8-B10, suggest yes by default on complex/multi-step/ reasoning-heavy tasks (code, analysis, research). Skip them on short one-offs (a tweet, a quick rename, a format conversion) — reasoning overhead hurts there and the JSON ranking already pushes them to the end for
task_type == "short".
Phase 2b — Non-English prompt adaptation
The Python regex layer is English-centric. If the user's prompt is in a non-English language (Korean, Japanese, Spanish, etc.), the deterministic
filled flags WILL underreport — a Korean
prompt that says "단계별로 생각해봐" is real chain-of-thought
framing but B8's English regex will not catch it.
In that case, YOU (the Claude driving this session) override the regex with your own language judgment:
- Read the user's prompt in their native language.
- For each slot, ask yourself the slot's underlying question
(see
inQUESTION_TEMPLATES
) in the user's language.boost.py - Mark
yourself when the prompt genuinely answers that question, regardless of what the regex said.filled - In your reply, note which slots you judged filled beyond the regex — transparency matters.
Example — Korean prompt: "AI 안전에 대한 500단어 블로그 글 써줘. ML 엔지니어 대상. 단계별로 생각하고 각 주장을 검증해줘."
Regex likely marks B1, B3 filled (English-friendly tokens leaked through) but misses B4 ("ML 엔지니어 대상"), B8 ("단계별로 생각"), B9 ("각 주장을 검증"). You mark all five filled and ask remaining questions in Korean.
Phase 3: Compose
python3 "$SKILL_DIR/scripts/boost.py" expand "<prompt>" \ --length "<answer>" --format "<answer>" --context "<answer>" ...
The script prints a composition prompt (NOT an API response). Read the template. Compose the rewritten user prompt as your next reply, following the rules in the emitted template: imperative verb + object, fold each answer in once, no hedging, no preamble.
Phase 4: Show the diff
After emitting the rewrite, summarize:
Original (9 words, 1/7): "write me a blog post" Rewritten (67 words, 7/7): [your composition] Added: length, audience, format, tone, constraints
Offer one more iteration if the user wants to tune a slot.
Flow C — Debug by symptom
Phase 1: Match
python3 "$SKILL_DIR/scripts/symptom_search.py" "<symptom>" --json
Phase 2: Teach
Read the matched primitive or technique doc. Quote the TL;DR. Explain in 2 sentences: what the failure mode is, why it happens, which primitive prevents it.
Phase 3: Bridge to LINT
If the user has a file where the symptom is firing, offer to run Flow A against it.
Flow D — Evaluate (two-stage, subagent-powered)
Turns opus-mind from "regex linter" into "measured linter." No API key needed — uses the Claude Code Agent tool to dispatch subagents. v0.2 splits role-play and grading across two subagents to remove self-grading bias.
Phase 1: Prepare
opus-mind eval audit-corpus # audit.py --json per corpus prompt opus-mind eval prepare-tasks # render roleplay task files
Phase 2: Role-play (Haiku, parallel)
For each task, spawn one Haiku subagent. Haiku's lower safety floor lets system-prompt structure show through more than Sonnet would.
Agent( subagent_type="general-purpose", model="haiku", prompt="Read evals/tasks_roleplay/<task_id>.md. Follow its instructions. Write JSON to evals/responses/<task_id>.json." )
Each subagent writes a
responses array to disk. No grading at this
stage.
Phase 3: Blind grade (Sonnet, parallel)
Render grade-task prompts from the responses, then dispatch Sonnet graders. Critically, the grader receives only the response + ideal_behavior + rubric. It does NOT see the system prompt that produced the response. This is what makes grading blind.
for f in "$SKILL_DIR/evals/responses"/*.json; do tid=$(basename "$f" .json) opus-mind eval render-grade "$tid" "$f" \ > "$SKILL_DIR/evals/tasks_grade/${tid}.md" done
Agent( subagent_type="general-purpose", model="sonnet", prompt="Read evals/tasks_grade/<task_id>.md. Write JSON to evals/grades/<task_id>.json." )
Phase 4: Aggregate
opus-mind eval aggregate
Joins
responses/ + grades/ → results/, then produces
evals/REPORT.md with:
- per-prompt audit score vs behavior score
- per-category behavior means
- per-invariant correlation (mean(pass) − mean(fail))
Δ < 0.3 on a 1-5 scale = invariant is not load-bearing on this corpus. The v0.2 run found every invariant Δ ≤ 0.06, i.e. zero invariants were load-bearing on the v0.2 corpus. That is a real finding, not a regression.
Phase 5: Act on the delta
- Drop invariants whose Δ is near 0 across categories.
- Boost invariants with Δ > 0.5 in the README (none qualify yet).
- Add harder cases (multi-turn drift, plausible business social engineering) — that is the most likely place structure will start to matter.
- Add more corpus prompts from real CL4R1T4S-style leaks.
Platform adaptation
- Blocking question tool:
(Claude Code) /AskUserQuestion
(Codex) /request_user_input
(Gemini)ask_user - Content search:
+Grep
(Claude Code) /Glob
(Codex) /rg
(Gemini)search_files - File read / edit: platform-native — never shell out for these
- LLM synthesis: never shell out. The surrounding platform already runs a model.
Common mistakes
- Dumping raw
output. The skill's value is prose synthesis with line refs. JSON is input, not output.audit --json - Asking all empty boost slots at once. One question at a time, ranked by impact. Users abandon 6-question menus.
- Requesting an
. You are the LLM. Compose in-chat. The architecture block forbids external calls.ANTHROPIC_API_KEY - Scoring
withSKILL.md
. Wrong genre.audit.py --self
targets system prompts;audit.py
is instructions for Claude. Different ruleset, different evaluator.SKILL.md - Starting with
before showing the report. Users need to see why a change is recommended before approving it.fix --add
How to invoke
Natural-language triggers that fire this skill:
- "audit my CLAUDE.md" → Flow A
- "score this SKILL.md" → Flow A
- "is this
any good?" → Flow A.cursorrules - "my bot refuses then gives in two turns later" → Flow C
- "help me turn 'write a blog post' into a real prompt" → Flow B
- "I want this better — here's my prompt: [...]" → Flow B
- "why does Claude keep narrating tool calls in my chatbot?" → Flow C → Flow A
The 11 structural invariants (LINT)
Source refs are line numbers in the leaked Opus 4.7 system prompt (CL4R1T4S mirror). Every check is regex + count with an explicit threshold — no vibe grading.
| ID | Primitive | Signal | Source |
|---|---|---|---|
| I1 | 03 hard-numbers | hedge_density ≤ 0.25, number_density ≥ 0.10 | L664, L620 |
| I2 | 02 decision-ladders | Step N tokens + stop-at-first-match | L515–L537 |
| I3 | 09 reframe-as-signal | reframe clause when refusal content present | L33 |
| I4 | 08 anti-narration | zero forbidden preambles | L536, L560 |
| I5 | 06 example + rationale | every example carries a rationale | L710–L750 |
| I6 | technique 04 | consequences ≥ directives / 10 | L753–L759 |
| I7 | 01 namespace-blocks | every has | structural |
| I8 | 04 default + exception | default + (unless/except/only-when) cooccur | L25, L57–68 |
| I9 | 07 self-check | self-check block when prompt is long | L698–L707 |
| I10 | pattern: tier-labels | ALLCAPS multi-word markers for high-stakes | L640, L657 |
| I11 | 12 hierarchical-override | Tier N / X > Y > Z / "takes precedence" | L657 |
The
plan.py domain inference (has_tools, has_refusals,
is_long, has_examples, has_conflicts) decides which
invariants are required for a given file. Always required:
I1, I2, I4, I6, I7, I8.
The 10 BOOST slots
Specification layer — grounded in Anthropic public prompt-engineering docs:
| ID | Slot | Answers the question |
|---|---|---|
| B1 | task | What to produce (imperative verb + object) |
| B2 | format | Output shape (JSON / markdown / bullets / prose) |
| B3 | length | Numeric budget (words / tokens / lines / bullets) |
| B4 | context | Audience + background |
| B5 | few_shot | Example of desired output |
| B6 | constraints | Tone / style / avoid-list |
| B7 | clarify | Ambiguity policy (ask vs assume + flag) |
Reasoning layer — grounded in
evidence/smart-prompting-refs.md:
| ID | Slot | Technique | Source |
|---|---|---|---|
| B8 | reasoning | ask for step-by-step / outline-first thinking | Wei 2022 (CoT) |
| B9 | verification | ask for self-check / flag uncertain claims | Shinn 2023 (Reflexion) |
| B10 | decomposition | ask for plan-before-execute / break into subtasks | Zhou 2022 (Least-to-most) |
None of these overlap with the system prompt. They are the slots only the user can fill — specification shapes what Claude produces, reasoning shapes how Claude thinks.
Evidence and attribution
Every recommendation anchors to
source/opus-4.7.txt:L### or
a primitive/technique file in references/. Source not hosted
here — see source/README.md for the
CL4R1T4S pointer. This skill is independent third-party analysis,
not endorsed by Anthropic.