Claude-skill-registry document-extraction
Extract requirements from existing documents including PDFs, Word docs, meeting transcripts, specifications, and web content. Identifies requirement candidates, categorizes them, and outputs in pre-canonical format.
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/document-extraction" ~/.claude/skills/majiayu000-claude-skill-registry-document-extraction && rm -rf "$T"
skills/data/document-extraction/SKILL.mdDocument Extraction Skill
Extract requirements from existing documentation sources for systematic requirement mining.
When to Use This Skill
Keywords: extract requirements, document mining, PDF requirements, transcript analysis, parse document, existing documentation, legacy requirements, competitive analysis
Invoke this skill when:
- Mining requirements from existing documents
- Processing meeting transcripts for requirements
- Extracting requirements from competitor products
- Analyzing regulatory documents for compliance requirements
- Converting legacy documentation to structured requirements
Supported Document Types
| Type | Extension | Extraction Method |
|---|---|---|
| Markdown | .md | Direct Read |
| Text | .txt | Direct Read |
| Read tool (PDF support) | ||
| Word | .docx | Read tool |
| Web Page | URL | WebFetch tool |
| Meeting Notes | .md, .txt | Transcript patterns |
| Specification | .md, .docx | Requirement patterns |
Extraction Workflow
Step 1: Document Assessment
Analyze the document to determine extraction strategy:
document_assessment: path: "{file path or URL}" type: "{detected document type}" size: "{approximate size}" structure: has_sections: true|false has_lists: true|false has_tables: true|false quality: formal_language: true|false clear_requirements: true|false needs_interpretation: true|false
Step 2: Pattern Matching
Apply requirement detection patterns:
Explicit Requirement Markers:
- "The system shall..." - "The system must..." - "Users should be able to..." - "REQ-XXX:" - Numbered requirements (1.1, 1.2, etc.)
EARS Patterns:
- "When [trigger], the [system] shall [response]" - "While [state], the [system] shall [behavior]" - "Where [feature], the [system] shall [behavior]" - "If [condition], then the [system] shall [response]"
Implicit Requirement Indicators:
- "It is important that..." - "We need..." - "The goal is to..." - "Users expect..." - "Performance should..."
Step 3: Requirement Extraction
For each identified requirement:
extracted_requirement: id: REQ-{sequence} text: "{cleaned requirement statement}" source: document source_file: "{file path}" source_location: "{section/page/line}" original_text: "{exact text from document}" type: functional|non-functional|constraint|assumption confidence: high|medium|low extraction_method: explicit|pattern|inferred needs_review: true|false review_notes: "{why review needed}"
Step 4: Categorization
Categorize extracted requirements:
categories: functional: - features - behaviors - interactions non_functional: - performance - security - usability - reliability - scalability constraints: - technical - business - regulatory assumptions: - environmental - user_behavior - dependencies
Step 5: Deduplication
Identify and merge duplicate requirements:
deduplication: strategy: semantic_similarity threshold: 0.8 action: merge|flag_for_review merged_requirements: - id: REQ-merged-001 sources: [REQ-001, REQ-015] text: "{consolidated requirement}"
Document-Specific Strategies
Meeting Transcripts
transcript_extraction: focus_on: - Action items - Decisions made - Requirements discussed - Concerns raised patterns: - "We decided that..." - "The requirement is..." - "Action item:" - "TODO:" - "Need to..." speaker_context: - Note who said what - Weight by speaker role
Regulatory Documents
regulatory_extraction: focus_on: - Mandatory requirements ("shall", "must") - Prohibited actions ("shall not", "must not") - Conditional requirements ("if...then") compliance_mapping: - Reference section numbers - Note effective dates - Track version/revision
Competitor Analysis
competitor_extraction: focus_on: - Feature descriptions - User capabilities - Unique selling points output: - Feature requirements - Differentiation opportunities - Gap identification confidence: low # Based on external observation
Legacy Specifications
legacy_extraction: focus_on: - Existing requirements - System behaviors - Integration points modernization: - Update terminology - Convert to EARS format - Flag deprecated requirements
Output Format
Per-Document Output
extraction_result: source: file: "{path or URL}" type: "{document type}" extraction_date: "{ISO-8601}" confidence: high|medium|low statistics: total_candidates: {number} extracted: {number} filtered: {number} needs_review: {number} requirements: - id: REQ-{number} text: "{requirement}" type: functional|non-functional|constraint source_location: "{section/page}" confidence: high|medium|low original_text: "{exact source text}" review_items: - requirement_id: REQ-{number} reason: "{why review needed}" suggestion: "{proposed action}" metadata: sections_processed: {number} extraction_patterns_used: ["{pattern names}"]
Autonomy Levels
Guided Mode
guided_behavior: document_selection: Human selects extraction_strategy: AI suggests, human approves each_requirement: AI highlights, human confirms categorization: AI suggests, human validates
Semi-Autonomous Mode
semi_auto_behavior: document_selection: AI suggests priority, human approves list extraction_strategy: AI chooses autonomously requirements: AI extracts all, human reviews in batches categorization: AI categorizes, human spot-checks
Fully Autonomous Mode
full_auto_behavior: document_selection: AI processes all relevant extraction_strategy: AI optimizes per document requirements: AI extracts, deduplicates, categorizes output: Full extraction report for final review
Quality Indicators
High Confidence Extraction
- Explicit requirement markers ("shall", "must")
- EARS-pattern matches
- Numbered requirement lists
- Clear imperative statements
Medium Confidence Extraction
- Implicit indicators ("should", "needs to")
- Context-dependent interpretation
- Partial pattern matches
- Requires domain knowledge
Low Confidence Extraction
- Inferred from descriptions
- Narrative text interpretation
- Competitive analysis
- Assumptions based on context
Delegation
For related tasks, delegate to:
- gap-analysis: Check extracted requirements for completeness
- domain-research: Research unfamiliar terms or concepts
- elicitation-methodology: Route back for technique selection
Output Location
Save extraction results to:
.requirements/{domain}/documents/DOC-{filename}-{timestamp}.yaml
Related
- Parent hub skillelicitation-methodology
- Post-extraction completeness checkinggap-analysis
- Clarify extracted requirements with stakeholdersinterview-conducting