Skills robotics-vla

install
source · Clone the upstream repo
git clone https://github.com/openclaw/skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/openclaw/skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/arden2010/robotics-vla" ~/.claude/skills/clawdbot-skills-robotics-vla && rm -rf "$T"
manifest: skills/arden2010/robotics-vla/SKILL.md
source content

Robotics VLA Skill

Expert guidance for building generalist robot policies using Vision-Language-Action (VLA) flow models, based on the π0 architecture.

Core Architecture

π0 model = VLM backbone + action expert + flow matching

ComponentDetail
VLM backbonePaliGemma (3B) — provides visual + language understanding
Action expertSeparate transformer weights (~300M) for robot state + actions
Total params~3.3B
Action outputChunks of H=50 actions; 50Hz or 20Hz robots
Inference speed~73ms on RTX 4090

See

references/architecture.md
for full technical details (attention masks, flow matching math, MoE design).

Training Pipeline

Two-phase approach (mirrors LLM training):

  1. Pre-training → broad physical capabilities + recovery behaviors across many tasks/robots
  2. Fine-tuning → fluent, task-specific execution on target task

Key rule: combining both phases outperforms either alone. Pre-training gives robustness; fine-tuning gives precision.

See

references/training.md
for data mixture ratios, loss functions, and fine-tuning dataset sizing.

Action Representation

Use flow matching, not autoregressive discretization.

  • Flow matching models continuous action distributions → essential for high-frequency dexterous control
  • Autoregressive token prediction (e.g. RT-2 style) cannot produce action chunks efficiently
  • Action chunks allow open-loop execution at 50Hz without temporal ensembling

Multi-Embodiment Support

Single model handles 7+ robot configurations via:

  • Zero-padding smaller action spaces to match the largest (17-dim)
  • Shared VLM backbone; embodiment-specific behavior learned via data
  • Weighted task sampling: n^0.43 to handle imbalanced data across robot types

See

references/embodiments.md
for robot platform specs and action space details.

High-Level Policy Integration

For long-horizon tasks, use a two-tier approach:

  • High-level VLM: decomposes task ("bus the table") → subtasks ("pick up napkin")
  • Low-level π0: executes each subtask as a language-conditioned action sequence

Analogous to SayCan. Intermediate language commands significantly boost performance vs. flat task descriptions.

Related & Complementary Research (2025)

π0 has been extended and complemented by several key works. See

references/related-work.md
for the full landscape, including:

  • π0-FAST / π0.5 / π0.6 — direct successors with faster training, open-world generalization, and RL fine-tuning
  • RTC — async action chunking to eliminate inference pauses (plug-in, no retraining)
  • UniVLA — unsupervised action extraction from raw video (no action labels needed)
  • ManiFlow / Streaming Flow — smoother action generation
  • GR00T N1, Helix, OpenVLA-OFT, DiVLA, RDT-1B — parallel approaches from NVIDIA, Figure AI, and academia

Evaluation Checklist

When evaluating a robot manipulation policy:

  • Out-of-box generalization (no fine-tuning) vs. baselines
  • Language following accuracy with flat / human-guided / HL commands
  • Fine-tuning efficiency (success rate vs. hours of data)
  • Complex multi-stage tasks (5–20 min, recovery from failure)
  • Compare: OpenVLA, Octo, ACT, Diffusion Policy as baselines