install
source · Clone the upstream repo
git clone https://github.com/jmagly/aiwg
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jmagly/aiwg "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.agents/skills/research-quality" ~/.claude/skills/jmagly-aiwg-research-quality && rm -rf "$T"
manifest:
.agents/skills/research-quality/SKILL.mdsource content
Research Quality Command
Perform systematic GRADE evidence quality assessment on research sources.
Instructions
When invoked, perform rigorous quality assessment:
-
Load Source
- Accept REF-XXX identifier or file path
- Load PDF and finding document
- Extract frontmatter metadata
- Determine source type and baseline quality
-
Apply GRADE Framework
Baseline Quality (by source type):
- Systematic review / Meta-analysis → HIGH
- Randomized controlled trial → HIGH
- Cohort study → MODERATE
- Case-control study → MODERATE
- Case series → LOW
- Expert opinion → LOW
- Preprint / Blog post → VERY LOW
Downgrade Factors (each -1 level):
- Risk of bias (study design flaws, conflicts of interest)
- Inconsistency (conflicting results across studies)
- Indirectness (different population, indirect comparisons)
- Imprecision (small sample size, wide confidence intervals)
- Publication bias (selective reporting)
Upgrade Factors (each +1 level):
- Large effect size (strong effect, dose-response relationship)
- Confounding works against finding (makes result conservative)
- Dose-response gradient present
-
Calculate Final GRADE
Final GRADE = Baseline + Upgrades - Downgrades HIGH: Strong confidence, unlikely to change with new evidence MODERATE: Moderate confidence, may change with new evidence LOW: Limited confidence, likely to change with new evidence VERY LOW: Very uncertain, any estimate is very uncertain -
Generate Assessment Report
- Document baseline quality
- List all downgrade/upgrade factors with justification
- Calculate final GRADE level
- Provide hedging language recommendations
- Assess AIWG applicability
-
Save Assessment
- Save to
.aiwg/research/quality-assessments/REF-XXX-assessment.yaml - Update frontmatter in finding document if --update-frontmatter
- Log in quality assessment index
- Save to
-
Check Existing Citations
- If --check-citations, scan corpus for citations of this source
- Flag any violations (overclaiming beyond GRADE level)
- Generate remediation suggestions
Arguments
- Source to assess (required)[ref-id or file-path]
- Output format (default: yaml)--output [yaml|markdown]
- Update finding document frontmatter with assessment--update-frontmatter
- Scan corpus for citation policy violations--check-citations
- Interactive assessment with prompts for each factor--interactive
Examples
# Basic quality assessment /research-quality REF-022 # Assessment with frontmatter update /research-quality REF-022 --update-frontmatter # Interactive assessment /research-quality REF-022 --interactive # Assessment with citation check /research-quality REF-022 --check-citations --output markdown
Expected Output
Assessing Quality: REF-022 - AutoGen ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Step 1: Determining baseline Source Type: arXiv preprint (later published in conference) Baseline Quality: MODERATE (conference paper) Note: Upgraded from VERY LOW due to peer review Step 2: Applying GRADE framework Downgrade Factors: [✓] Risk of Bias: -0 (no significant bias detected) - Study design appropriate - No apparent conflicts of interest - Methodology clearly described [✓] Inconsistency: -0 (single study, no comparison) - No conflicting results to evaluate [!] Indirectness: -0 (directly applicable) - Population: Software development teams - Intervention: Multi-agent conversation framework - Direct relevance to AIWG agent orchestration [!] Imprecision: -1 (limited evaluation scope) - Small benchmark dataset - Limited real-world validation - No confidence intervals reported [✓] Publication Bias: -0 (mitigated) - Open preprint, full methodology disclosed - Negative results discussed Upgrade Factors: [!] Large Effect: +0 (moderate effect size) - Improvements shown but not exceptionally large - Effect sizes: 10-30% improvement range [✓] Dose-Response: +0 (not applicable) - No dose-response relationship to evaluate [✓] Confounding: +0 (no clear confounding against) Step 3: Calculating final GRADE Baseline: MODERATE Downgrades: -1 (imprecision) Upgrades: +0 ───────────────────────── Final GRADE: LOW Step 4: Generating assessment report ✓ Assessment saved: .aiwg/research/quality-assessments/REF-022-assessment.yaml ✓ Frontmatter updated in finding document ✓ Quality index updated ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ GRADE Assessment: LOW Confidence: Limited confidence in effect estimates Appropriate Hedging Language: ✓ USE: "Limited evidence suggests...", "Preliminary findings indicate..." ✗ AVOID: "Research demonstrates...", "Evidence proves..." Rationale: While AutoGen shows promising multi-agent collaboration patterns, the evidence base is limited to a single study with small-scale evaluation. Real-world effectiveness at scale requires further investigation. AIWG Applicability: - Patterns are directly applicable to agent orchestration (HIGH) - Implementation risk is moderate due to limited production validation - Recommend: Pilot implementation with monitoring Next Steps: 1. Monitor for follow-up studies strengthening evidence base 2. Plan validation studies within AIWG context 3. Review citations of REF-022 in corpus: /research-quality REF-022 --check-citations
Assessment YAML Output
# .aiwg/research/quality-assessments/REF-022-assessment.yaml ref_id: REF-022 assessment_date: "2026-02-03" assessor: "quality-agent" source_metadata: title: "AutoGen: Enabling Next-Gen LLM Applications..." source_type: peer_reviewed_conference year: 2023 grade_assessment: baseline: MODERATE baseline_rationale: "Peer-reviewed conference paper" downgrade_factors: - factor: imprecision severity: -1 rationale: "Limited evaluation scope, small benchmarks" upgrade_factors: [] final_grade: LOW confidence_statement: "Limited confidence in effect estimates" hedging_language: appropriate: - "Limited evidence suggests" - "Preliminary findings indicate" - "Initial research shows" inappropriate: - "Research demonstrates" - "Evidence proves" - "Studies conclusively show" aiwg_applicability: relevance: HIGH implementation_risk: MODERATE recommendation: "Pilot implementation with validation" citation_guidance: template: | Research provides preliminary evidence for [claim] (@.aiwg/research/findings/REF-022-autogen.md), though broader validation is needed (GRADE: LOW).
Citation Policy Integration
When --check-citations is used:
Checking citations of REF-022 across corpus... Found 8 citations: ✓ COMPLIANT (5): - .aiwg/architecture/agent-orchestration-sad.md:142 "Research suggests flexible conversation patterns..." Hedging: APPROPRIATE for LOW quality - .aiwg/requirements/UC-174-conversable-agent.md:23 "Evidence indicates multi-agent collaboration is feasible..." Hedging: APPROPRIATE for LOW quality ✗ VIOLATIONS (3): - docs/agent-framework.md:78 "Research demonstrates significant improvements..." Hedging: TOO STRONG for LOW quality Suggestion: Change to "Limited evidence suggests..." - .aiwg/architecture/adr-012-agent-protocol.md:45 "Studies prove that conversation patterns enable..." Hedging: TOO STRONG for LOW quality Suggestion: Change to "Preliminary findings indicate..." Remediation script generated: .aiwg/research/quality-assessments/REF-022-violations.sh
Interactive Mode
When --interactive is used, prompts for each factor:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ GRADE Assessment: REF-022 (Interactive Mode) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Baseline Quality: MODERATE (conference paper) ──────────────────────────────────────────────────────────────────── Factor 1: Risk of Bias ──────────────────────────────────────────────────────────────────── Evaluate study design quality, conflicts of interest, methodology clarity. Downgrade by 1 level? [y/N]: n Rationale: Study design appropriate, no COI detected ──────────────────────────────────────────────────────────────────── Factor 2: Inconsistency ──────────────────────────────────────────────────────────────────── Evaluate consistency across studies (if multiple). Downgrade by 1 level? [y/N]: n Rationale: Single study, no comparison available [... continues for all factors ...]
References
- @$AIWG_ROOT/agentic/code/frameworks/research-complete/agents/quality-agent.md - Quality Assessment Agent
- @$AIWG_ROOT/src/research/services/quality-service.ts - GRADE implementation
- @.aiwg/research/docs/grade-assessment-guide.md - GRADE methodology
- @$AIWG_ROOT/agentic/code/frameworks/sdlc-complete/schemas/research/quality-dimensions.yaml - Quality schema
- @$AIWG_ROOT/agentic/code/frameworks/sdlc-complete/rules/citation-policy.md - Hedging language requirements