Asi entropy-sim2real
Entropy-driven sim2real transfer. Uses maximum entropy RL, domain randomization, and information-theoretic bridging to close the reality gap.
install
source · Clone the upstream repo
git clone https://github.com/plurigrid/asi
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/plurigrid/asi "$T" && mkdir -p ~/.claude/skills && cp -r "$T/plugins/asi/skills/entropy-sim2real" ~/.claude/skills/plurigrid-asi-entropy-sim2real && rm -rf "$T"
manifest:
plugins/asi/skills/entropy-sim2real/SKILL.mdsource content
Entropy-Driven Sim2Real Transfer
Trit: -1 (MINUS - analysis/verification) Color: #E85B8E (Rose Pink) URI: skill://entropy-sim2real#E85B8E
Core Insight
Entropy bridges the sim-real gap by:
- Maximizing entropy in simulation → Policy sees diverse conditions
- Minimizing entropy at deployment → Uncertainty collapses to reality
- Information-theoretic alignment → Match distributions, not parameters
SIMULATION REALITY High Entropy ─────────────────────────────▶ Low Entropy H(params) = max ══════════▶ H(params) ≈ 0 H(π|s) = high ══════════▶ H(π|s) = focused p(sim) = broad ══════════▶ p(real) = delta ┌─────────────────┐ ┌─────────────────┐ │ MANY POSSIBLE │ BRIDGE │ ONE ACTUAL │ │ WORLDS │───────────────│ WORLD │ │ (superpos.) │ │ (collapsed) │ └─────────────────┘ └─────────────────┘
Three Entropy Mechanisms
1. Domain Randomization Entropy
Maximize entropy over simulation parameters:
import jax import jax.numpy as jnp from typing import Dict class EntropyMaximizingRandomizer: """Domain randomization that maximizes parameter entropy.""" def __init__(self, param_ranges: Dict[str, tuple]): self.param_ranges = param_ranges def entropy(self, distribution: str = "uniform") -> float: """Compute entropy of parameter distributions.""" H = 0.0 for name, (low, high) in self.param_ranges.items(): if distribution == "uniform": # H(Uniform) = log(b - a) H += jnp.log(high - low) elif distribution == "gaussian": # H(Gaussian) = 0.5 * log(2πeσ²) sigma = (high - low) / 4 # 95% within range H += 0.5 * jnp.log(2 * jnp.pi * jnp.e * sigma**2) return H def sample(self, key: jax.random.PRNGKey) -> Dict[str, float]: """Sample parameters to maximize coverage.""" params = {} for i, (name, (low, high)) in enumerate(self.param_ranges.items()): k = jax.random.fold_in(key, i) # Uniform maximizes entropy for bounded support params[name] = jax.random.uniform(k, minval=low, maxval=high) return params def adaptive_entropy( self, key: jax.random.PRNGKey, real_samples: jnp.ndarray, temperature: float = 1.0 ) -> Dict[str, float]: """ Adapt randomization to maximize coverage of real distribution. Uses maximum entropy principle: find distribution with highest entropy subject to matching observed moments. """ # Estimate real distribution moments real_mean = jnp.mean(real_samples, axis=0) real_var = jnp.var(real_samples, axis=0) # Maximum entropy distribution matching moments = Gaussian params = {} for i, (name, _) in enumerate(self.param_ranges.items()): k = jax.random.fold_in(key, i) # Sample from Gaussian matching real moments (max entropy) params[name] = jax.random.normal(k) * jnp.sqrt(real_var[i]) + real_mean[i] return params
2. Maximum Entropy RL
Policy optimization with entropy regularization:
class MaxEntropyPPO: """ PPO with entropy bonus for robust sim2real. Objective: max E[Σ γᵗ(rₜ + α·H(π(·|sₜ)))] High entropy → diverse actions → robust to perturbations """ def __init__( self, entropy_coef: float = 0.01, target_entropy: float = -1.0, auto_tune: bool = True ): self.alpha = entropy_coef self.target_entropy = target_entropy self.auto_tune = auto_tune if auto_tune: # Learnable temperature (SAC-style) self.log_alpha = jnp.log(entropy_coef) def policy_entropy(self, logits: jnp.ndarray) -> float: """Compute policy entropy H(π) = -Σ π(a)log(π(a)).""" probs = jax.nn.softmax(logits) log_probs = jax.nn.log_softmax(logits) return -jnp.sum(probs * log_probs, axis=-1).mean() def gaussian_entropy(self, std: jnp.ndarray) -> float: """Entropy of Gaussian policy: H = 0.5 * log(2πeσ²).""" return 0.5 * jnp.log(2 * jnp.pi * jnp.e * std**2).sum(axis=-1).mean() def entropy_loss( self, policy_entropy: float, update_alpha: bool = True ) -> tuple: """ Compute entropy loss and optionally update temperature. We want: H(π) ≥ H_target Loss: α * (H(π) - H_target) """ entropy_bonus = self.alpha * policy_entropy if self.auto_tune and update_alpha: # Dual gradient descent on temperature alpha_loss = -self.log_alpha * (policy_entropy - self.target_entropy) return entropy_bonus, alpha_loss return entropy_bonus, 0.0 def robust_policy_loss( self, advantages: jnp.ndarray, log_probs: jnp.ndarray, old_log_probs: jnp.ndarray, policy_entropy: float, clip_ratio: float = 0.2 ) -> float: """ PPO loss with entropy regularization. L = L_clip + α·H(π) High entropy prevents overconfident policies that fail on real hardware. """ # Standard PPO clipped objective ratio = jnp.exp(log_probs - old_log_probs) clipped = jnp.clip(ratio, 1 - clip_ratio, 1 + clip_ratio) policy_loss = -jnp.minimum(ratio * advantages, clipped * advantages).mean() # Entropy bonus (negative because we minimize loss) entropy_bonus = -self.alpha * policy_entropy return policy_loss + entropy_bonus
3. Information-Theoretic Bridging
Minimize information gap between sim and real:
class InformationTheoreticBridge: """ Bridge sim and real via information-theoretic measures. Key insight: We can't match physics exactly, but we can match the *information content* of observations. """ def mutual_information( self, sim_obs: jnp.ndarray, real_obs: jnp.ndarray ) -> float: """ Estimate I(sim; real) - how much sim tells us about real. High MI = sim is predictive of real (good!) Low MI = sim and real are independent (bad!) """ # Use MINE estimator or simple correlation joint_cov = jnp.cov(sim_obs.T, real_obs.T) n = sim_obs.shape[1] cov_sim = joint_cov[:n, :n] cov_real = joint_cov[n:, n:] cov_joint = joint_cov # MI = 0.5 * log(|Σ_sim||Σ_real| / |Σ_joint|) mi = 0.5 * ( jnp.linalg.slogdet(cov_sim)[1] + jnp.linalg.slogdet(cov_real)[1] - jnp.linalg.slogdet(cov_joint)[1] ) return mi def domain_divergence( self, sim_obs: jnp.ndarray, real_obs: jnp.ndarray, method: str = "wasserstein" ) -> float: """ Measure divergence between sim and real distributions. Lower divergence = better sim2real transfer. """ if method == "kl": # KL(real || sim) - how surprised is sim by real? # Requires density estimation pass elif method == "wasserstein": # W_2 distance (optimal transport) mu_sim = jnp.mean(sim_obs, axis=0) mu_real = jnp.mean(real_obs, axis=0) cov_sim = jnp.cov(sim_obs.T) cov_real = jnp.cov(real_obs.T) # W_2² = ||μ_sim - μ_real||² + Tr(Σ_sim + Σ_real - 2(Σ_sim^½ Σ_real Σ_sim^½)^½) mean_diff = jnp.sum((mu_sim - mu_real)**2) # Simplified: use Frobenius norm of covariance difference cov_diff = jnp.sum((cov_sim - cov_real)**2) return jnp.sqrt(mean_diff + cov_diff) elif method == "mmd": # Maximum Mean Discrepancy from functools import partial def rbf_kernel(x, y, sigma=1.0): return jnp.exp(-jnp.sum((x - y)**2) / (2 * sigma**2)) n, m = len(sim_obs), len(real_obs) # MMD² = E[k(x,x')] + E[k(y,y')] - 2E[k(x,y)] xx = jnp.mean(jax.vmap(lambda x: jax.vmap(lambda x2: rbf_kernel(x, x2))(sim_obs))(sim_obs)) yy = jnp.mean(jax.vmap(lambda y: jax.vmap(lambda y2: rbf_kernel(y, y2))(real_obs))(real_obs)) xy = jnp.mean(jax.vmap(lambda x: jax.vmap(lambda y: rbf_kernel(x, y))(real_obs))(sim_obs)) return xx + yy - 2 * xy def entropy_matching_loss( self, sim_obs: jnp.ndarray, real_obs: jnp.ndarray ) -> float: """ Match entropy profiles between sim and real. If H(sim) >> H(real): sim too noisy, reduce randomization If H(sim) << H(real): sim too deterministic, increase randomization """ def estimate_entropy(obs): # Estimate via covariance determinant (Gaussian assumption) cov = jnp.cov(obs.T) return 0.5 * jnp.linalg.slogdet(cov)[1] H_sim = estimate_entropy(sim_obs) H_real = estimate_entropy(real_obs) return (H_sim - H_real)**2
The Entropy Bridge Pipeline
┌────────────────────────────────────────────────────────────────────┐ │ ENTROPY-DRIVEN SIM2REAL │ ├────────────────────────────────────────────────────────────────────┤ │ │ │ PHASE 1: Maximum Entropy Simulation │ │ ──────────────────────────────────── │ │ │ │ Domain Params Policy Observations │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ H(θ) = max │ ───▶ │ H(π|s) = αT │ ───▶ │ H(o) = high │ │ │ │ friction ∈ │ │ explore all │ │ diverse │ │ │ │ [0.3, 1.5] │ │ actions │ │ experiences │ │ │ │ mass ∈ │ └─────────────┘ └─────────────┘ │ │ │ [0.8, 1.2] │ │ │ └─────────────┘ │ │ │ │ PHASE 2: Information Bridge │ │ ─────────────────────────── │ │ │ │ Sim Distribution Divergence Real Distribution │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ p(o|sim) │ ──────▶│ W(sim,real) │◀─── │ p(o|real) │ │ │ │ (broad) │ │ minimize │ │ (narrow) │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │ │ │ │ Adapt randomization │ │ to match real entropy │ │ │ │ PHASE 3: Entropy Collapse at Deployment │ │ ──────────────────────────────────────── │ │ │ │ Policy trained on Deployed on Result │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ ALL possible│ ───▶ │ ONE actual │ ───▶ │ ROBUST to │ │ │ │ worlds │ │ world │ │ any world │ │ │ │ (superpos.) │ │ (collapsed) │ │ in support │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │ │ └────────────────────────────────────────────────────────────────────┘
Integration with K-Scale Stack
from ksim import PPOTask, PhysicsRandomizer from ksim.randomizers import ( StaticFrictionRandomizer, MassMultiplicationRandomizer, JointDampingRandomizer, ) class EntropyBridgedKBotTask(PPOTask): """K-Bot training with entropy-driven sim2real.""" # High-entropy domain randomization physics_randomizers = [ StaticFrictionRandomizer(scale=0.5), # Wide friction range MassMultiplicationRandomizer( # Body mass variation body_name="torso", scale=0.2 ), JointDampingRandomizer(scale=0.3), # Damping variation # ... more randomizers for max entropy ] # Max-entropy RL config entropy_coef = 0.02 # High entropy bonus target_entropy = -4.0 # Automatic temperature tuning def compute_entropy_metrics(self, trajectory): """Track entropy throughout training.""" policy_entropy = self.policy.entropy(trajectory.obs) obs_entropy = self.estimate_obs_entropy(trajectory.obs) return { "policy_entropy": policy_entropy, "observation_entropy": obs_entropy, "entropy_ratio": policy_entropy / obs_entropy, } def adapt_randomization(self, real_data): """ Adapt domain randomization to match real robot entropy. This is the key insight: we don't try to match exact parameters, we match the *entropy profile*. """ sim_obs = self.collect_sim_observations() real_obs = real_data.observations # Compute entropy gap H_sim = self.estimate_entropy(sim_obs) H_real = self.estimate_entropy(real_obs) if H_sim > H_real * 1.5: # Sim too noisy, reduce randomization self.reduce_randomization_scale(0.9) elif H_sim < H_real * 0.7: # Sim too deterministic, increase randomization self.increase_randomization_scale(1.1) # Match distribution via Wasserstein W = self.wasserstein_distance(sim_obs, real_obs) self.log("wasserstein_distance", W)
Why Entropy Works for Sim2Real
1. Coverage Guarantee
If policy π is optimal for ALL sims in support of p(sim), and real world ∈ support of p(sim), then π works in real world. Key: Entropy maximization → widest possible support
2. Robustness via Exploration
High H(π|s) → policy doesn't overfit to single solution → maintains multiple viable strategies → can adapt when reality differs
3. Information Bottleneck
Sim and real share mutual information I(sim; real) Maximize I → sim captures what matters about real Ignore I → overfit to sim-specific artifacts
GF(3) Triads
entropy-sim2real (-1) ⊗ kos-firmware (+1) ⊗ mujoco-scenes (0) = 0 ✓ entropy-sim2real (-1) ⊗ jaxlife-open-ended (+1) ⊗ wobble-dynamics (0) = 0 ✓ ksim-rl (-1) ⊗ kos-firmware (+1) ⊗ entropy-sim2real (-1) = needs +1
Related Skills
(-1): Base RL trainingksim-rl
(+1): Deployment targetkos-firmware
(0): Ergodic theory foundationsergodicity
(-1): Time averagesbirkhoff-average
(-1): Distribution dynamicsfokker-planck-analyzer
References
@article{haarnoja2018sac, title={Soft Actor-Critic: Off-Policy Maximum Entropy Deep RL}, author={Haarnoja, Tuomas and others}, journal={ICML}, year={2018} } @article{tobin2017domain, title={Domain Randomization for Transferring Deep Neural Networks}, author={Tobin, Josh and others}, journal={IROS}, year={2017} } @article{zhao2020sim, title={Sim-to-Real Transfer in Deep Reinforcement Learning}, author={Zhao, Wenshuai and others}, journal={IEEE TNNLS}, year={2020} }