Awesome-omni-skill qwen_training_data_miner_prototype
Qwen Training Data Miner (Prototype)
git clone https://github.com/diegosouzapw/awesome-omni-skill
T=$(mktemp -d) && git clone --depth=1 https://github.com/diegosouzapw/awesome-omni-skill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/design/qwen_training_data_miner_prototype" ~/.claude/skills/diegosouzapw-awesome-omni-skill-qwen-training-data-miner-prototype && rm -rf "$T"
skills/design/qwen_training_data_miner_prototype/SKILL.mdQwen Training Data Miner (Prototype)
Metadata (YAML Frontmatter)
skill_id: qwen_training_data_miner_v1_prototype name: qwen_training_data_miner description: Mine 012.txt for domain-specific training examples (MPS scoring, WSP patterns, decision rationale) version: 1.0_prototype author: 0102_design created: 2025-10-22 agents: [qwen] primary_agent: qwen intent_type: GENERATION promotion_state: prototype pattern_fidelity_threshold: 0.90 test_status: needs_validation
MCP Orchestration
mcp_orchestration: true breadcrumb_logging: true owning_dae: doc_dae execution_phase: 1 next_skill: gemma_domain_trainer_v1_prototype
Input/Output Contract
inputs:
- source_file: "O:/Foundups-Agent/012.txt (98,400 lines)"
- domain: "Target knowledge domain (mps_scoring, wsp_application, roadmap_analysis, etc.)"
- pattern_type: "Type of pattern to extract (numeric_examples, decision_trees, rationale_chains)"
- min_examples: "Minimum number of examples to extract (default: 50)" outputs:
- data/training_datasets/{domain}_training_data.json: "Instruction-tuning dataset"
- data/training_datasets/{domain}_pattern_summary.json: "Pattern analysis metadata"
- execution_id: "Unique execution identifier for breadcrumb tracking"
Dependencies
dependencies: data_stores: - name: 012_scrapbook type: text path: O:/Foundups-Agent/012.txt mcp_endpoints: - endpoint_name: holo_index methods: [semantic_search] throttles: [] required_context: - domain: "Knowledge domain to mine" - pattern_regex: "Regex pattern for extraction"
Metrics Configuration
metrics: pattern_fidelity_scoring: enabled: true frequency: every_execution scorer_agent: gemma write_destination: modules/infrastructure/wre_core/recursive_improvement/metrics/qwen_training_data_miner_fidelity.json promotion_criteria: min_pattern_fidelity: 0.90 min_outcome_quality: 0.85 min_execution_count: 100 required_test_pass_rate: 0.95
Qwen Training Data Miner
Purpose: Mine 012.txt (0102's decision history) for domain-specific training examples to train Gemma models
Intent Type: GENERATION
Agent: qwen (1.5B, 32K context - can hold large sections of 012.txt)
Task
You are Qwen, a training data miner. Your job is to read 012.txt (98,400 lines of 0102's decision-making history) and extract high-quality training examples for specific knowledge domains. You create instruction-tuning datasets that Gemma can learn from.
Key Capability: Pattern recognition, example extraction, quality filtering
Domains You Can Mine:
- mps_scoring - WSP 15 scoring examples with numeric calculations
- wsp_application - How WSPs are applied to real problems
- roadmap_analysis - Project planning, completion tracking
- readme_patterns - Documentation structure, best practices
- modlog_updates - Change documentation patterns
- first_principles - Occam's Razor reasoning chains
Instructions (For Qwen Agent)
1. LOAD SOURCE FILE
Rule: Read 012.txt in chunks (32K token window)
Expected Pattern:
source_loaded=True
Steps:
- Open
O:/Foundups-Agent/012.txt - Count total lines (should be ~98,400)
- Calculate chunk size (fit within 32K context)
- Load first chunk for analysis
- Log:
{"pattern": "source_loaded", "value": true, "total_lines": 98400, "chunk_size": 5000}
2. IDENTIFY DOMAIN PATTERNS
Rule: Search for domain-specific patterns using regex and semantic matching
Expected Pattern:
domain_patterns_identified=True
Domain-Specific Patterns:
MPS Scoring Domain
patterns = [ r"MPS.*Score:?\s*(\d+)", r"Complexity.*(\d)\s*,?\s*Importance.*(\d)\s*,?\s*Deferability.*(\d)\s*,?\s*Impact.*(\d)", r"Priority:?\s*(P[0-4])", r"MPS.*\(C:(\d),\s*I:(\d),\s*D:(\d),\s*P:(\d)\)" ]
WSP Application Domain
patterns = [ r"WSP\s*(\d+).*compliance", r"WSP\s*(\d+).*violation", r"following\s+WSP\s*(\d+)", r"applied\s+WSP\s*(\d+)" ]
Roadmap Analysis Domain
patterns = [ r"roadmap.*complete", r"roadmap.*incomplete", r"roadmap.*needs.*update", r"Phase\s*(\d+).*status", r"TODO.*implement" ]
Steps:
- For target domain, compile regex patterns
- Scan 012.txt for matching lines
- Extract context (5 lines before/after match)
- Store matches with line numbers
- Log:
{"pattern": "domain_patterns_identified", "value": true, "matches_found": N}
3. EXTRACT TRAINING EXAMPLES
Rule: Convert matched patterns into instruction-tuning format
Expected Pattern:
examples_extracted=True
Instruction-Tuning Format:
{ "instruction": "Apply WSP 15 MPS scoring to this task", "input": { "task_description": "Migrate agent_permissions to SQLite", "context": "Database consolidation, 21 tests passing, high priority" }, "output": { "complexity": 3, "complexity_reason": "Moderate - requires schema design + migration", "importance": 5, "importance_reason": "Essential - blocks other migrations", "deferability": 5, "deferability_reason": "Cannot defer - P0 priority", "impact": 4, "impact_reason": "Major - enables autonomous permission system", "mps_total": 17, "priority": "P0" }, "source": "012.txt:line_5234", "quality_score": 0.95 }
Steps:
- For each match, extract:
- Instruction: What task is being performed?
- Input: What context/data is provided?
- Output: What is the correct answer/decision?
- Source: Line number for verification
- Quality filter:
- Complete examples only (has input + output)
- Clear reasoning (not ambiguous)
- Correct format (follows pattern)
- Assign quality score (0.0-1.0)
- Log:
{"pattern": "examples_extracted", "value": true, "total_examples": N, "high_quality": M}
4. QUALITY FILTERING
Rule: Only keep examples with quality_score >= 0.85
Expected Pattern:
quality_filtering_applied=True
Quality Criteria:
- ✅ Complete (has instruction + input + output)
- ✅ Clear reasoning (rationale provided)
- ✅ Correct format (matches instruction-tuning schema)
- ✅ Verifiable (can trace back to source line)
- ✅ Unambiguous (single correct interpretation)
Steps:
- Review each extracted example
- Score on 5 criteria (0.2 per criterion)
- Filter: keep only if score >= 0.85 (4/5 criteria)
- Remove duplicates (same input/output pattern)
- Log:
{"pattern": "quality_filtering_applied", "value": true, "kept": N, "filtered": M}
5. GENERATE PATTERN SUMMARY
Rule: Analyze extracted examples for meta-patterns
Expected Pattern:
pattern_summary_generated=True
Summary Metadata:
{ "domain": "mps_scoring", "total_examples": 73, "high_quality_examples": 58, "quality_distribution": { "0.95-1.0": 23, "0.90-0.94": 20, "0.85-0.89": 15 }, "common_patterns": [ "P0 tasks: MPS 16-20 (23 examples)", "P1 tasks: MPS 13-15 (19 examples)", "Complexity 3-4 most common (database migrations, refactoring)" ], "coverage_analysis": { "p0_examples": 23, "p1_examples": 19, "p2_examples": 12, "p3_examples": 3, "p4_examples": 1 }, "recommended_use": "Train Gemma on MPS scoring for cleanup tasks, project prioritization" }
Steps:
- Count examples by category/pattern
- Identify common themes
- Assess coverage (are all cases represented?)
- Generate training recommendations
- Log:
{"pattern": "pattern_summary_generated", "value": true}
6. WRITE TRAINING DATASET
Rule: Output JSON file with instruction-tuning examples
Expected Pattern:
training_dataset_written=True
Output Format (EXECUTION-READY per First Principles):
{ "dataset_id": "mps_scoring_training_v1", "created": "2025-10-22T02:30:00Z", "source": "012.txt (lines 1-98400)", "domain": "mps_scoring", "total_examples": 58, "quality_threshold": 0.85, "domain_priority_mps": { "complexity": 2, "complexity_reason": "Easy - pattern extraction from 012.txt", "importance": 4, "importance_reason": "Critical - enables autonomous MPS scoring", "deferability": 3, "deferability_reason": "Moderate - other wardrobes can be trained first", "impact": 5, "impact_reason": "Critical - foundation for cleanup automation", "total": 14, "priority": "P1", "training_order": 1 }, "examples": [ { "example_id": "mps_001", "instruction": "...", "input": {...}, "output": {...}, "source": "012.txt:line_5234", "quality_score": 0.95 }, ... ], "metadata": { "pattern_summary": {...}, "coverage_analysis": {...}, "recommended_use": "..." }, "recommended_wardrobe_config": { "wardrobe_id": "gemma_mps_scorer_v1", "lora_rank": 8, "learning_rate": 0.0002, "epochs": 3, "expected_accuracy": 0.87, "use_cases": [ "Cleanup task prioritization", "Project scoring", "Issue triage" ] }, "autonomous_execution": { "capable": true, "agent": "gemma_domain_trainer_v1", "confidence": 0.90, "estimated_tokens": 200, "estimated_time_seconds": 600, "requires_0102_approval": false, "execution_command": "python -m modules.infrastructure.wsp_orchestrator.src.wsp_orchestrator --skill gemma_domain_trainer --domain mps_scoring --dataset data/training_datasets/mps_scoring_training_data.json" }, "verification": { "verify_command": "test -f data/training_datasets/mps_scoring_training_data.json && jq '.total_examples' data/training_datasets/mps_scoring_training_data.json", "success_criteria": "File exists + total_examples >= 50 + quality_threshold >= 0.85", "validation_script": "python -c \"import json; d=json.load(open('data/training_datasets/mps_scoring_training_data.json')); assert d['total_examples'] >= 50; assert d['quality_threshold'] >= 0.85; print('✓ Dataset validated')\"" }, "learning_feedback": { "pattern_extraction_stats": { "total_patterns_found": 73, "high_quality_kept": 58, "filter_rate": 0.79, "common_filter_reasons": [ "Incomplete example (missing rationale) - 8 filtered", "Ambiguous input - 5 filtered", "Duplicate pattern - 2 filtered" ] }, "domain_insights": [ "P0 tasks: MPS 16-20 (23 examples) - database migrations, critical bugs", "P1 tasks: MPS 13-15 (19 examples) - feature requests, refactoring", "Complexity 3-4 most common - moderate difficulty tasks" ], "future_improvements": [ "Add semantic deduplication (beyond exact match)", "Extract negative examples (what NOT to do)", "Mine multi-step reasoning chains for complex decisions" ], "store_to": "holo_index/adaptive_learning/training_data_mining_patterns.jsonl" } }
Destination:
data/training_datasets/{domain}_training_data.json
Steps:
- Create directory
if not existsdata/training_datasets/ - Calculate domain_priority_mps (which domain should be trained first?)
- Generate recommended_wardrobe_config (LoRA hyperparameters)
- Write training dataset JSON with all First Principles fields
- Generate autonomous_execution command (can Gemma trainer auto-execute?)
- Create verification script (validate dataset quality)
- Extract learning_feedback (pattern extraction stats + future improvements)
- Log:
{"pattern": "training_dataset_written", "value": true, "file_size_kb": N, "autonomous_ready": true}
First Principles Additions:
- ✅ MPS Scoring: domain_priority_mps determines training order (which wardrobe first?)
- ✅ Agent Mapping: autonomous_execution.agent = gemma_domain_trainer_v1
- ✅ Executable Command: Can pipe to bash to start training automatically
- ✅ Verification: validation_script confirms dataset quality before training
- ✅ Learning Feedback: Stores pattern extraction stats for future mining improvements
- ✅ Recommended Config: Wardrobe hyperparameters (LoRA rank, learning rate, epochs)
Expected Patterns Summary
{ "execution_id": "exec_qwen_miner_001", "skill_id": "qwen_training_data_miner_v1_prototype", "patterns": { "source_loaded": true, "domain_patterns_identified": true, "examples_extracted": true, "quality_filtering_applied": true, "pattern_summary_generated": true, "training_dataset_written": true }, "total_examples_extracted": 73, "high_quality_examples": 58, "execution_time_ms": 3500 }
Fidelity Calculation:
(patterns_executed / 6) - All 6 steps should run
Domain Catalog
1. MPS Scoring Domain
Purpose: Train Gemma to apply WSP 15 MPS scoring Patterns: Numeric scores, priority mapping, rationale Use Cases: Cleanup prioritization, project planning, issue triage
2. WSP Application Domain
Purpose: Train Gemma to recognize WSP violations and applications Patterns: WSP references, compliance checks, violation detection Use Cases: Code review, documentation validation, architecture audits
3. Roadmap Analysis Domain
Purpose: Train Gemma to analyze project roadmaps Patterns: Phase completion, TODO tracking, update detection Use Cases: Project status reports, roadmap audits, completion tracking
4. README Patterns Domain
Purpose: Train Gemma to validate README structure Patterns: Required sections, format consistency, completeness Use Cases: Documentation quality checks, README generation
5. ModLog Updates Domain
Purpose: Train Gemma to generate ModLog entries Patterns: Change descriptions, WSP references, rationale Use Cases: Automated ModLog updates, change tracking
6. First Principles Domain
Purpose: Train Gemma to apply Occam's Razor reasoning Patterns: Problem simplification, root cause analysis, decision trees Use Cases: Debugging, architecture design, problem-solving
Benchmark Test Cases
Test Set 1: MPS Scoring Extraction (10 cases)
- Input: "MPS Score: 16" → Expected: Extract as P0 example
- Input: "Complexity: 3, Importance: 5, Deferability: 2, Impact: 4" → Expected: Calculate MPS = 14
- Input: "Priority: P1" → Expected: Map to MPS 13-15 range
- Input: Incomplete example (missing rationale) → Expected: Quality score < 0.85, filtered
- Input: Duplicate example → Expected: Deduplicated
Test Set 2: WSP Application Extraction (5 cases)
- Input: "Following WSP 15 for scoring" → Expected: Extract WSP 15 application example
- Input: "WSP 64 violation detected" → Expected: Extract violation example
- Input: "WSP compliance: WSP 3, WSP 50" → Expected: Extract multi-WSP compliance
- Input: Ambiguous WSP reference → Expected: Quality score < 0.85
- Input: Clear WSP application with rationale → Expected: Quality score >= 0.90
Test Set 3: Quality Filtering (5 cases)
- Input: Complete example with all fields → Expected: Quality score = 1.0
- Input: Missing rationale → Expected: Quality score = 0.8 (filtered)
- Input: Ambiguous input → Expected: Quality score = 0.6 (filtered)
- Input: Clear but partial example → Expected: Quality score = 0.85 (kept)
- Input: Excellent example with source → Expected: Quality score = 0.95
Total: 20 test cases across 3 categories
Success Criteria
- ✅ Pattern fidelity ≥ 90% (all 6 steps execute)
- ✅ Extract ≥ 50 high-quality examples per domain
- ✅ Quality threshold 0.85+ maintained
- ✅ Zero duplicate examples in output
- ✅ All examples have verifiable source (line number)
- ✅ Pattern summary provides actionable insights
Next Phase: Gemma Training
After extraction, examples feed into
gemma_domain_trainer skill:
- Load training dataset
- Fine-tune Gemma 270M on domain examples
- Validate accuracy on held-out test set
- Deploy trained model for domain-specific tasks
Wardrobe Concept: Training as a Service
Different "training wardrobes" for different knowledge domains:
- Trained on MPS scoring examplesqwen_mps_scorer
- Trained on WSP compliance examplesqwen_wsp_auditor
- Trained on roadmap analysis examplesqwen_roadmap_tracker
- Trained on README patternsqwen_readme_validator
Each wardrobe:
- Mines 012.txt for domain-specific patterns
- Trains Gemma on extracted examples
- Deploys as reusable skill
- Evolves as more examples accumulate
Meta-skill:
qwen_wardrobe_generator - Automates creation of new training wardrobes for any domain!
Status: ✅ Ready for prototype testing - Mine 012.txt for MPS scoring examples first