Agent-almanac argumentation
git clone https://github.com/pjt222/agent-almanac
T=$(mktemp -d) && git clone --depth=1 https://github.com/pjt222/agent-almanac "$T" && mkdir -p ~/.claude/skills && cp -r "$T/i18n/caveman-ultra/skills/argumentation" ~/.claude/skills/pjt222-agent-almanac-argumentation-6fff56 && rm -rf "$T"
i18n/caveman-ultra/skills/argumentation/SKILL.mdConstruct Arguments
Build rigorous arguments hypothesis → reasoning → concrete evidence. Triad: hypothesis = what you believe, argument = why holds, examples = that holds. Apply to code reviews, design decisions, research writing, any ctx w/ claims needing justification.
Use When
- Writing/reviewing PR desc proposing technical change
- Justifying design decision in ADR
- Code review feedback beyond "I don't like this"
- Research argument or technical proposal
- Challenge/defend approach in technical discussion
In
- Required: Claim or position needing justification
- Required: Context (code review, design decision, research, doc)
- Optional: Audience (peer devs, reviewers, stakeholders, researchers)
- Optional: Counterarguments or alternative positions
- Optional: Evidence or data available
Do
Step 1: Formulate Hypothesis
Clear, falsifiable claim. Not opinion or preference — specific assertion testable vs evidence.
- Claim in 1 sentence
- Falsifiability test: can someone prove wrong w/ evidence?
- Scope narrowly: specific ctx, codebase, domain
- Distinguish from opinions → testable criteria
Falsifiable vs unfalsifiable:
| Unfalsifiable (opinion) | Falsifiable (hypothesis) |
|---|---|
| "This code is bad" | "This function has O(n^2) complexity where O(n) is achievable" |
| "We should use TypeScript" | "TypeScript's type system will catch the class of null-reference bugs that caused 4 of our last 6 production incidents" |
| "The API design is cleaner" | "Replacing the 5 endpoint variants with a single parameterized endpoint reduces the public API surface by 60%" |
| "This research approach is better" | "Method A achieves higher precision than Method B on dataset X at the 95% confidence level" |
→ 1-sentence hypothesis specific + scoped + falsifiable. Reader immediately imagines evidence confirming/refuting.
If err: Vague → "how would I disprove?" test. Can't imagine counter-evidence → opinion not hypothesis. Narrow scope or add measurable criteria.
Step 2: ID Argument Type
Logical structure for hypothesis. Diff claims need diff reasoning strategies.
- Review 4 types:
| Type | Structure | Best for |
|---|---|---|
| Deductive | If A then B; A true; therefore B | Formal proofs, type safety |
| Inductive | Observed pattern N cases; therefore likely | Perf data, test results |
| Analogical | X similar to Y relevant ways; Y has P; therefore X likely P | Design decisions, tech choices |
| Evidential | E more likely under H1 than H2; therefore H1 supported | Research findings, A/B results |
-
Match hypothesis → strongest type:
- must be true → deductive
- tends to be true via observations → inductive
- will likely work via similar prior cases → analogical
- one explanation fits data better → evidential
-
Combine types for stronger arguments (analogical + inductive evidence)
→ Chosen type (or combo) + clear rationale why fits hypothesis.
If err: No single type fits cleanly → split hypothesis into sub-claims. Each w/ natural argument structure.
Step 3: Construct Argument
Logical chain connecting hypothesis → justification.
- State premises (facts/assumptions starting from)
- Show logical connection (premises → conclusion)
- Steelman strongest counterargument — state best opposing before refuting
- Address counterargument directly w/ evidence or reasoning
Worked example — Code Review (deductive + inductive):
Hypothesis: "Extracting validation logic into shared module will reduce bug duplication across 3 API handlers."
Premises:
- 3 handlers (
,createUser,updateUser) impl same input valid. w/ slight variations (observeddeleteUser)src/handlers/- Last 6 months, 3/5 valid. bugs fixed in 1 handler not propagated (issues #42, #57, #61)
- Shared modules enforce single src of truth (deductive: if one impl, then one place to fix)
Logical chain: Because 3 handlers duplicate same valid. (premise 1), bugs fixed in 1 missed in others (premise 2, inductive from 3/5). Shared module → fixes apply once to all callers (deductive from shared-module semantics). Therefore extraction reduces bug duplication.
Counterargument (steelmanned): "Shared modules introduce coupling — change to valid. for 1 handler could break others."
Rebuttal: Handlers already share identical valid. intent; coupling implicit + harder to maintain. Making explicit via shared module w/ parameterized options (
) makes coupling visible + testable. Current implicit duplication riskier — hides dependency.validate(input, { requireEmail: true })
Worked example — Research (evidential):
Hypothesis: "Pre-training on domain-specific corpora improves downstream task perf more than increasing general corpus size for biomedical NER."
Premises:
- BioBERT pre-trained on PubMed (4.5B words) outperforms BERT-Large on general English (16B words) on 6/6 biomedical NER benchmarks (Lee et al., 2020)
- SciBERT pre-trained on Semantic Scholar (3.1B words) outperforms BERT-Base on SciERC + JNLPBA despite smaller pre-training corpus
- General-domain scaling (BERT-Base → BERT-Large, 3x params) smaller gains on biomedical NER than domain adaptation (BERT-Base → BioBERT, same params)
Logical chain: Evidence consistently shows domain corpus selection outweighs scale for biomedical NER (evidential: these results more likely if domain specificity > scale). 3 independent comparisons same direction → strengthens inductive case.
Counterargument (steelmanned): "Results may not generalize beyond biomedical NER — biomedicine has unusually specialized vocab inflating domain-adaptation advantage."
Rebuttal: Valid limitation. Hypothesis scoped biomedical NER specifically. Similar gains appear legal NLP (Legal-BERT) + financial NLP (FinBERT) → pattern may generalize to other specialized domains, separate claim needing its own evidence.
→ Complete argument chain + premises + logical connection + steelmanned counterargument + rebuttal. Reader follows step by step.
If err: Argument weak → check premises. Weak args stem from unsupported premises not faulty logic. Find evidence per premise or acknowledge as assumption. Counterargument stronger than rebuttal → hypothesis needs revision.
Step 4: Concrete Examples
Support w/ independently verifiable evidence. Not illustrations — empirical foundation making argument testable.
- ≥1 positive example confirming hypothesis
- ≥1 edge case/boundary example testing limits
- Each independently verifiable: another person can reproduce or check no relying on your interpretation
- Code claims → specific files, line nums, commits
- Research claims → specific papers, datasets, experimental results
Example selection criteria:
| Criterion | Good example | Bad example |
|---|---|---|
| Independently verifiable | "Issue #42 shows the bug was fixed in handler A but not B" | "We've seen this kind of bug before" |
| Specific | " at line 47 re-implements the same regex as at line 23" | "There's duplication in the codebase" |
| Representative | "3 of 5 validation bugs in the last 6 months followed this pattern" | "I once saw a bug like this" |
| Includes edge cases | "This pattern holds for string inputs but not for file upload validation, which has handler-specific constraints" | (no limitations mentioned) |
→ Concrete examples reader can verify independently. ≥1 positive + 1 edge case. Each refs specific artifact (file, line, issue, paper, dataset).
If err: Examples hard to find → hypothesis too broad or not grounded observable reality. Narrow scope → what you can actually point to. Absence = signal, not gap to paper over.
Step 5: Assemble Complete Argument
Combine hypothesis + argument + examples → appropriate format for ctx.
-
Code reviews — structure:
[S] <one-line summary of the suggestion> **Hypothesis**: <what you believe should change and why> **Argument**: <the logical case, with premises> **Evidence**: <specific files, lines, issues, or metrics> **Suggestion**: <concrete code change or approach> -
PR descriptions — body:
## Why <Hypothesis: what problem this solves and the specific improvement claim> ## Approach <Argument: why this approach was chosen over alternatives> ## Evidence <Examples: benchmarks, bug references, before/after comparisons> -
ADRs (Architecture Decision Records) — std ADR format + triad → Context (hypothesis), Decision (argument), Consequences (examples/evidence of expected outcomes)
-
Research writing — std structure: Intro = hypothesis, Methods/Results = argument + examples, Discussion = counterarguments
-
Review assembled for:
- Logical gaps (conclusion follow from premises?)
- Missing evidence (unsupported premises?)
- Unaddressed counterarguments (strongest objection answered?)
- Scope creep (stays within hypothesis bounds?)
→ Complete formatted argument appropriate for ctx. Reader can eval hypothesis, follow reasoning, check evidence, consider counterarguments — all in 1 coherent structure.
If err: Assembled feels disjointed → hypothesis too broad. Split into focused sub-arguments, each w/ own triad. 2 tight > 1 sprawling.
Check
- Hypothesis falsifiable (could disprove w/ evidence)
- Hypothesis scoped specific ctx, not universal
- Argument type ID'd + appropriate for claim
- Premises stated explicitly, not assumed shared knowledge
- Logical chain connects premises → conclusion no gaps
- Strongest counterargument steelmanned + addressed
- ≥1 positive example supports
- ≥1 edge case or limitation acknowledged
- All examples independently verifiable (refs provided)
- Out format matches ctx (code review, PR, ADR, research)
- No logical fallacies (appeal to authority, false dichotomy, strawman)
Traps
- State opinions as hypotheses: "This code is messy" = preference not hypothesis. Rewrite testable: "This module has 4 responsibilities that should be separated per single-responsibility principle, as evidenced by 6 public methods spanning 3 unrelated domains."
- Skip counterargument: Unaddressed objections weaken even if reader never voices. Always steelman — state strongest opposing in best form before rebutting.
- Vague examples: "We've seen this pattern" not evidence. Point to specific issues, commits, lines, papers, datasets. Can't find concrete → hypothesis not well-grounded.
- Argument from authority: "Senior engineer said so" or "Google does it this way" not logical argument. Authority motivates investigation; argument must stand on evidence + reasoning.
- Scope creep in conclusions: Drawing broader than evidence supports. Examples cover 3 API handlers → don't conclude about entire codebase. Match conclusion scope → evidence scope.
- Conflate argument types: Inductive lang ("tends to") for deductive ("must be") or vice versa. Precise about strength — deductive = certainty, inductive = probability.
→
— apply argumentation → structured code review feedbackreview-pull-request
— evidence-based arguments in researchreview-research
— justify architectural decisions via triadreview-software-architecture
— skills = structured arguments for how to accomplish taskcreate-skill
— doc conventions + decisions benefiting from clear justificationwrite-claude-md
Composition: Argumentation + Advocatus Diaboli
High-stakes decisions → compose w/
advocatus-diaboli agent → pre-decision review loop:
- Structure via argumentation — build triad
- Stress-test via advocatus-diaboli — steelman proposal, challenge each assumption w/ specific questions. Severity: Critical (redesign/abandon), Medium (adjust), Low (note + proceed)
- Revise per findings — critical → redesign; medium → adjustment; low → noted
When compose vs alone:
- Argumentation alone → constructing proposal, PR desc, design justification
- Advocatus-diaboli alone → reviewing someone else's existing argument
- Compose both → you're both proposer + need adversarial self-review pre-committing
Example — PR response refinement: Argumentation structured response (hypothesis: combining PRs better, argument w/ evidence, collaboration offer). Advocatus-diaboli caught 2 critical issues: claim about proxy proc ID speculative not factual (would've been embarrassing on security PR), "I have tested this in practice" unverifiable. Both removed. Final 40-50% shorter — overexplaining signals insecurity.
Example — System design triage: Argumentation (via Plan agent) designed full 500-line triage pipeline. Advocatus-diaboli killed: at 9 items, premature + system itself would become maintenance burden (recursive trap). Final: 25 lines added to existing script.