install
source · Clone the upstream repo
git clone https://github.com/Aradotso/trending-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/Aradotso/trending-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/openmythos-recurrent-transformer" ~/.claude/skills/aradotso-trending-skills-openmythos-recurrent-transformer && rm -rf "$T"
manifest:
skills/openmythos-recurrent-transformer/SKILL.mdsource content
--- name: openmythos-recurrent-transformer description: Build and experiment with Recurrent-Depth Transformer (RDT) models using OpenMythos, a theoretical reconstruction of the Claude Mythos architecture with looped transformers, MLA/GQA attention, and sparse MoE. triggers: - implement a looped transformer model - build a recurrent depth transformer - use OpenMythos for inference time scaling - configure MLA or GQA attention in OpenMythos - set up mixture of experts with recurrent blocks - train a model with adaptive loop iterations - generate text with variable reasoning depth - explore compute-adaptive transformer architectures --- # OpenMythos Recurrent-Depth Transformer > Skill by [ara.so](https://ara.so) — Daily 2026 Skills collection. OpenMythos is an open-source theoretical implementation of a Recurrent-Depth Transformer (RDT) inspired by the Claude Mythos architecture. It divides computation into three stages: **Prelude** (standard transformer layers run once), a **Recurrent Block** (looped up to `max_loop_iters` times with stable injection), and a **Coda** (standard transformer layers run once). Attention is switchable between MLA (Multi-head Latent Attention) and GQA (Grouped Query Attention), and feed-forward layers use sparse MoE with routed and shared experts. ## Installation ```bash git clone https://github.com/The-Swarm-Corporation/OpenMythos.git cd OpenMythos pip install -r requirements.txt
Core Concepts
Architecture Flow
Input → Embedding ↓ [Prelude] — N standard transformer layers, run once ↓ [Recurrent Block] — looped T times per forward pass ↑_________↓ h_{t+1} = A·h_t + B·e + Transformer(h_t, e) ↓ [Coda] — N standard transformer layers, run once ↓ Output Logits
= encoded input from Prelude (injected every loop to prevent drift)e
,A
= learned stable injection parameters (spectral radius ρ(A) < 1 enforced)B- More loops at inference = deeper implicit reasoning (no extra parameters)
Attention Types
- MLA (
): Multi-head Latent Attention — uses KV LoRA compression, separate RoPE and NoPE head dimensions"mla" - GQA (
): Grouped Query Attention — fewer KV heads than Q heads, simpler config"gqa"
Configuration Reference
MythosConfig
Parameters
MythosConfig| Parameter | Type | Description |
|---|---|---|
| int | Vocabulary size |
| int | Model hidden dimension |
| int | Number of attention query heads |
| int | Number of KV heads (GQA ratio = n_heads/n_kv_heads) |
| int | Maximum sequence length |
| int | Maximum recurrent loop iterations |
| int | Number of Prelude transformer layers |
| int | Number of Coda transformer layers |
| int | Total number of MoE routed experts |
| int | Always-active shared experts |
| int | Top-K experts selected per token |
| int | Hidden dim inside each expert FFN |
| int | LoRA rank for injection parameters |
| str | or |
| int | MLA only: KV compression rank |
| int | MLA only: Q compression rank |
| int | MLA only: RoPE head dimension |
| int | MLA only: NoPE head dimension |
| int | MLA only: Value head dimension |
Usage Patterns
Pattern 1: GQA Model (Simpler Config)
import torch from open_mythos.main import OpenMythos, MythosConfig cfg = MythosConfig( vocab_size=32000, dim=512, n_heads=8, n_kv_heads=2, # GQA: 4x fewer KV heads max_seq_len=2048, max_loop_iters=8, prelude_layers=2, coda_layers=2, n_experts=16, n_shared_experts=2, n_experts_per_tok=2, expert_dim=256, lora_rank=16, attn_type="gqa", ) model = OpenMythos(cfg) print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}") # Forward pass with 4 reasoning loops ids = torch.randint(0, cfg.vocab_size, (2, 64)) logits = model(ids, n_loops=4) print(f"Logits shape: {logits.shape}") # (2, 64, 32000)
Pattern 2: MLA Model (More Expressive)
import torch from open_mythos.main import OpenMythos, MythosConfig cfg = MythosConfig( vocab_size=32000, dim=512, n_heads=8, n_kv_heads=8, # MLA uses full heads max_seq_len=2048, max_loop_iters=16, prelude_layers=2, coda_layers=2, n_experts=16, n_shared_experts=2, n_experts_per_tok=2, expert_dim=256, lora_rank=16, attn_type="mla", # MLA-specific compression dims kv_lora_rank=64, q_lora_rank=128, qk_rope_head_dim=32, qk_nope_head_dim=32, v_head_dim=32, ) model = OpenMythos(cfg) ids = torch.randint(0, cfg.vocab_size, (1, 128)) # More loops = deeper implicit reasoning logits_shallow = model(ids, n_loops=2) logits_deep = model(ids, n_loops=16) print(f"Shallow logits: {logits_shallow.shape}") print(f"Deep logits: {logits_deep.shape}")
Pattern 3: Text Generation with Variable Depth
import torch from open_mythos.main import OpenMythos, MythosConfig cfg = MythosConfig( vocab_size=50257, dim=768, n_heads=12, n_kv_heads=4, max_seq_len=1024, max_loop_iters=12, prelude_layers=2, coda_layers=2, n_experts=8, n_shared_experts=1, n_experts_per_tok=2, expert_dim=512, lora_rank=32, attn_type="gqa", ) model = OpenMythos(cfg) model.eval() prompt_ids = torch.randint(0, cfg.vocab_size, (1, 16)) # Simple task: fewer loops with torch.no_grad(): easy_output = model.generate( prompt_ids, max_new_tokens=32, n_loops=2, # shallow reasoning ) # Complex task: more loops (same model, more compute) with torch.no_grad(): hard_output = model.generate( prompt_ids, max_new_tokens=32, n_loops=12, # deep reasoning ) print(f"Easy output shape: {easy_output.shape}") print(f"Hard output shape: {hard_output.shape}")
Pattern 4: Stability Verification
Always verify the spectral radius constraint after initialization and training:
import torch from open_mythos.main import OpenMythos, MythosConfig cfg = MythosConfig( vocab_size=1000, dim=256, n_heads=8, n_kv_heads=2, max_seq_len=128, max_loop_iters=8, prelude_layers=1, coda_layers=1, n_experts=8, n_shared_experts=1, n_experts_per_tok=2, expert_dim=64, lora_rank=8, attn_type="gqa", ) model = OpenMythos(cfg) # Check injection matrix spectral radius — must be < 1 A = model.recurrent.injection.get_A() spectral_radius = A.max().item() print(f"Spectral radius ρ(A): {spectral_radius:.6f}") assert spectral_radius < 1.0, "UNSTABLE: spectral radius >= 1!" print("✓ Stability constraint satisfied")
Pattern 5: Scaling Experiment — Loops vs Quality
import torch import torch.nn.functional as F from open_mythos.main import OpenMythos, MythosConfig cfg = MythosConfig( vocab_size=1000, dim=256, n_heads=8, n_kv_heads=2, max_seq_len=64, max_loop_iters=16, prelude_layers=1, coda_layers=1, n_experts=8, n_shared_experts=1, n_experts_per_tok=2, expert_dim=64, lora_rank=8, attn_type="gqa", ) model = OpenMythos(cfg) model.eval() ids = torch.randint(0, cfg.vocab_size, (1, 32)) targets = torch.randint(0, cfg.vocab_size, (1, 32)) results = {} with torch.no_grad(): for n_loops in [1, 2, 4, 8, 16]: logits = model(ids, n_loops=n_loops) loss = F.cross_entropy( logits.view(-1, cfg.vocab_size), targets.view(-1) ) results[n_loops] = loss.item() print(f"n_loops={n_loops:2d} → loss={loss.item():.4f}") # Expect diminishing returns (saturating exponential decay) print("\nDelta losses (should decrease):") loop_counts = sorted(results.keys()) for i in range(1, len(loop_counts)): delta = results[loop_counts[i-1]] - results[loop_counts[i]] print(f" {loop_counts[i-1]}→{loop_counts[i]} loops: Δloss={delta:.4f}")
Pattern 6: Training Loop with Stability Monitoring
import torch import torch.nn.functional as F from torch.optim import AdamW from open_mythos.main import OpenMythos, MythosConfig cfg = MythosConfig( vocab_size=10000, dim=512, n_heads=8, n_kv_heads=2, max_seq_len=256, max_loop_iters=8, prelude_layers=2, coda_layers=2, n_experts=8, n_shared_experts=1, n_experts_per_tok=2, expert_dim=256, lora_rank=16, attn_type="gqa", ) model = OpenMythos(cfg) optimizer = AdamW(model.parameters(), lr=3e-4, weight_decay=0.1) def train_step(model, optimizer, input_ids, target_ids, n_loops=4): model.train() optimizer.zero_grad() logits = model(input_ids, n_loops=n_loops) loss = F.cross_entropy( logits.view(-1, cfg.vocab_size), target_ids.view(-1), ) loss.backward() # Gradient clipping recommended for looped models torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) optimizer.step() return loss.item() def check_stability(model): A = model.recurrent.injection.get_A() rho = A.max().item() return rho # Example training iterations for step in range(10): batch_size, seq_len = 4, 64 input_ids = torch.randint(0, cfg.vocab_size, (batch_size, seq_len)) target_ids = torch.randint(0, cfg.vocab_size, (batch_size, seq_len)) # Curriculum: start with fewer loops, increase over training n_loops = min(2 + step // 3, cfg.max_loop_iters) loss = train_step(model, optimizer, input_ids, target_ids, n_loops=n_loops) rho = check_stability(model) print(f"Step {step:3d} | loss={loss:.4f} | ρ(A)={rho:.4f} | loops={n_loops}") if rho >= 1.0: print("WARNING: Spectral radius constraint violated!")
Pattern 7: MLA vs GQA Comparison
import torch from open_mythos.main import OpenMythos, MythosConfig def build_model(attn_type: str) -> OpenMythos: base = dict( vocab_size=8000, dim=256, n_heads=8, max_seq_len=128, max_loop_iters=6, prelude_layers=1, coda_layers=1, n_experts=8, n_shared_experts=1, n_experts_per_tok=2, expert_dim=64, lora_rank=8, attn_type=attn_type, ) if attn_type == "gqa": cfg = MythosConfig(**base, n_kv_heads=2) else: # mla cfg = MythosConfig( **base, n_kv_heads=8, kv_lora_rank=32, q_lora_rank=64, qk_rope_head_dim=16, qk_nope_head_dim=16, v_head_dim=16, ) return OpenMythos(cfg) for attn_type in ["gqa", "mla"]: model = build_model(attn_type) param_count = sum(p.numel() for p in model.parameters()) ids = torch.randint(0, 8000, (2, 32)) logits = model(ids, n_loops=4) rho = model.recurrent.injection.get_A().max().item() print(f"{attn_type.upper()}: params={param_count:,} | " f"logits={logits.shape} | ρ(A)={rho:.4f}")
Key API Reference
OpenMythos(config)
OpenMythos(config)Main model class.
forward(input_ids, n_loops=None)
:input_ids
of shapeLongTensor(batch, seq_len)
: int, number of recurrent iterations (default:n_loops
)config.max_loop_iters- Returns:
of shapeFloatTensor(batch, seq_len, vocab_size)
generate(input_ids, max_new_tokens, n_loops=None)
:input_ids
of shapeLongTensor(batch, seq_len)
: intmax_new_tokens
: int, loops per forward step during generationn_loops- Returns:
of shapeLongTensor(batch, seq_len + max_new_tokens)
model.recurrent.injection.get_A()
model.recurrent.injection.get_A()Returns the learned injection matrix
A. Check A.max().item() < 1.0 for stability.
Common Patterns & Best Practices
Loop Curriculum During Training
Start with fewer loops and increase gradually — helps with initial stability:
n_loops = min(max_loops, 2 + global_step // 1000)
Inference-Time Scaling
Use fewer loops for simple/fast tasks, more for complex reasoning — same weights, adaptive compute:
# Classification / simple completion logits = model(ids, n_loops=2) # Multi-step reasoning / hard math logits = model(ids, n_loops=model.cfg.max_loop_iters)
Gradient Clipping
Always clip gradients when training looped models — prevents the rare instability when gradients backpropagate through many loop unrolls:
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
MoE Load Balancing
The router uses dynamic bias adjustment — no special loss term needed, but monitor expert utilization during training to detect collapse.
Troubleshooting
| Problem | Cause | Fix |
|---|---|---|
after training | Injection constraint violated | Lower learning rate; the constraint should be enforced by construction — check model version |
| Loss NaN / explosion | Residual stream diverging | Enable grad clipping (); reduce early in training |
on MLA config | Missing MLA-specific params | Ensure , , , , are all set |
OOM with high | Memory scales with loop unrolls during training | Use for eval; reduce batch size during high-loop training |
GQA error | Must divide evenly | Ensure |
| Slow generation | Generating with max loops | Reduce in for faster inference |
Project Structure
OpenMythos/ ├── open_mythos/ │ └── main.py # OpenMythos, MythosConfig, all sub-modules ├── docs/ │ └── open_mythos.md # Full API reference ├── requirements.txt └── README.md