AI-research-SKILLs evaluating-cosmos-policy

Evaluates NVIDIA Cosmos Policy on LIBERO and RoboCasa simulation environments. Use when setting up cosmos-policy for robot manipulation evaluation, running headless GPU evaluations with EGL rendering, or profiling inference latency on cluster or local GPU machines.

install

source · Clone the upstream repo

git clone https://github.com/Orchestra-Research/AI-Research-SKILLs

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/Orchestra-Research/AI-Research-SKILLs "$T" && mkdir -p ~/.claude/skills && cp -r "$T/18-multimodal/cosmos-policy" ~/.claude/skills/zechenzhangagi-ai-research-skills-evaluating-cosmos-policy && rm -rf "$T"

manifest: 18-multimodal/cosmos-policy/SKILL.md

Cosmos Policy Evaluation

Evaluation workflows for NVIDIA Cosmos Policy on LIBERO and RoboCasa simulation environments from the public

cosmos-policy

repository. Covers blank-machine setup, headless GPU evaluation, and inference profiling.

Quick start

Run a minimal LIBERO evaluation using the official public eval module:

uv run --extra cu128 --group libero --python 3.10 \
  python -m cosmos_policy.experiments.robot.libero.run_libero_eval \
    --config cosmos_predict2_2b_480p_libero__inference_only \
    --ckpt_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B \
    --config_file cosmos_policy/config/config.py \
    --use_wrist_image True \
    --use_proprio True \
    --normalize_proprio True \
    --unnormalize_actions True \
    --dataset_stats_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B/libero_dataset_statistics.json \
    --t5_text_embeddings_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B/libero_t5_embeddings.pkl \
    --trained_with_image_aug True \
    --chunk_size 16 \
    --num_open_loop_steps 16 \
    --task_suite_name libero_10 \
    --num_trials_per_task 1 \
    --local_log_dir cosmos_policy/experiments/robot/libero/logs/ \
    --seed 195 \
    --randomize_seed False \
    --deterministic True \
    --run_id_note smoke \
    --ar_future_prediction False \
    --ar_value_prediction False \
    --use_jpeg_compression True \
    --flip_images True \
    --num_denoising_steps_action 5 \
    --num_denoising_steps_future_state 1 \
    --num_denoising_steps_value 1 \
    --data_collection False

Core concepts

What Cosmos Policy is: NVIDIA Cosmos Policy is a vision-language-action (VLA) model that uses Cosmos Tokenizer to encode visual observations into discrete tokens, then predicts robot actions conditioned on language instructions and visual context.

Key architecture choices:

Component	Design
Visual encoder	Cosmos Tokenizer (discrete tokens)
Language conditioning	Cross-attention to language embeddings
Action prediction	Autoregressive action token generation

Public command surface: The supported evaluation entrypoints are

cosmos_policy.experiments.robot.libero.run_libero_eval

and

cosmos_policy.experiments.robot.robocasa.run_robocasa_eval

. Keep reproduction notes anchored to these public modules and their documented flags.

Compute requirements

Task	GPU	VRAM	Typical wall time
LIBERO smoke eval (1 trial)	1x A40/A100	~16 GB	5-10 min
LIBERO full eval (50 trials)	1x A40/A100	~16 GB	2-4 hours
RoboCasa single-task (2 trials)	1x A40/A100	~18 GB	10-15 min
RoboCasa all-tasks	1x A40/A100	~18 GB	4-8 hours

When to use vs alternatives

Use this skill when:

Evaluating NVIDIA Cosmos Policy on LIBERO or RoboCasa benchmarks
Profiling inference latency and throughput for Cosmos Policy
Setting up headless EGL rendering for robot simulation on GPU clusters

Use alternatives when:

Training or fine-tuning Cosmos Policy from scratch (use official Cosmos training docs)
Working with OpenVLA-based policies (use
```
fine-tuning-openvla-oft
```
)
Working with Physical Intelligence pi0 models (use
```
fine-tuning-serving-openpi
```
)
Running real-robot evaluation rather than simulation

Workflow 1: LIBERO evaluation

Copy this checklist and track progress:

LIBERO Eval Progress:
- [ ] Step 1: Install environment and dependencies
- [ ] Step 2: Configure headless EGL rendering
- [ ] Step 3: Run smoke evaluation
- [ ] Step 4: Validate outputs and parse results
- [ ] Step 5: Run full benchmark if smoke passes

Step 1: Install environment

git clone https://github.com/NVlabs/cosmos-policy.git
cd cosmos-policy
# Follow SETUP.md to build and enter the supported Docker container.
# Then, inside the container:
uv sync --extra cu128 --group libero --python 3.10

Step 2: Configure headless rendering

export CUDA_VISIBLE_DEVICES=0
export MUJOCO_EGL_DEVICE_ID=0
export MUJOCO_GL=egl
export PYOPENGL_PLATFORM=egl

Step 3: Run smoke evaluation

uv run --extra cu128 --group libero --python 3.10 \
  python -m cosmos_policy.experiments.robot.libero.run_libero_eval \
    --config cosmos_predict2_2b_480p_libero__inference_only \
    --ckpt_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B \
    --config_file cosmos_policy/config/config.py \
    --use_wrist_image True \
    --use_proprio True \
    --normalize_proprio True \
    --unnormalize_actions True \
    --dataset_stats_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B/libero_dataset_statistics.json \
    --t5_text_embeddings_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B/libero_t5_embeddings.pkl \
    --trained_with_image_aug True \
    --chunk_size 16 \
    --num_open_loop_steps 16 \
    --task_suite_name libero_10 \
    --num_trials_per_task 1 \
    --local_log_dir cosmos_policy/experiments/robot/libero/logs/ \
    --seed 195 \
    --randomize_seed False \
    --deterministic True \
    --run_id_note smoke \
    --ar_future_prediction False \
    --ar_value_prediction False \
    --use_jpeg_compression True \
    --flip_images True \
    --num_denoising_steps_action 5 \
    --num_denoising_steps_future_state 1 \
    --num_denoising_steps_value 1 \
    --data_collection False

Step 4: Validate and parse results

import json
import glob

# Find latest evaluation result from the official log directory
log_files = sorted(glob.glob("cosmos_policy/experiments/robot/libero/logs/**/*.json", recursive=True))
with open(log_files[-1]) as f:
    results = json.load(f)

print(results)

Step 5: Scale up

Run across all four LIBERO task suites with 50 trials:

for suite in libero_spatial libero_object libero_goal libero_10; do
  uv run --extra cu128 --group libero --python 3.10 \
    python -m cosmos_policy.experiments.robot.libero.run_libero_eval \
      --config cosmos_predict2_2b_480p_libero__inference_only \
      --ckpt_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B \
      --config_file cosmos_policy/config/config.py \
      --use_wrist_image True \
      --use_proprio True \
      --normalize_proprio True \
      --unnormalize_actions True \
      --dataset_stats_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B/libero_dataset_statistics.json \
      --t5_text_embeddings_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B/libero_t5_embeddings.pkl \
      --trained_with_image_aug True \
      --chunk_size 16 \
      --num_open_loop_steps 16 \
      --task_suite_name "$suite" \
      --num_trials_per_task 50 \
      --local_log_dir cosmos_policy/experiments/robot/libero/logs/ \
      --seed 195 \
      --randomize_seed False \
      --deterministic True \
      --run_id_note "suite_${suite}" \
      --ar_future_prediction False \
      --ar_value_prediction False \
      --use_jpeg_compression True \
      --flip_images True \
      --num_denoising_steps_action 5 \
      --num_denoising_steps_future_state 1 \
      --num_denoising_steps_value 1 \
      --data_collection False
done

Workflow 2: RoboCasa evaluation

Copy this checklist and track progress:

RoboCasa Eval Progress:
- [ ] Step 1: Install RoboCasa assets and verify macros
- [ ] Step 2: Run single-task smoke evaluation
- [ ] Step 3: Validate outputs
- [ ] Step 4: Expand to multi-task runs

Step 1: Install RoboCasa

git clone https://github.com/moojink/robocasa-cosmos-policy.git
uv pip install -e robocasa-cosmos-policy
python -m robocasa.scripts.setup_macros
python -m robocasa.scripts.download_kitchen_assets

This fork installs the

robocasa

Python package expected by Cosmos Policy while preserving the patched environment changes used in the public RoboCasa eval path. Verify

macros_private.py

exists and paths are correct.

Step 2: Single-task smoke evaluation

uv run --extra cu128 --group robocasa --python 3.10 \
  python -m cosmos_policy.experiments.robot.robocasa.run_robocasa_eval \
    --config cosmos_predict2_2b_480p_robocasa_50_demos_per_task__inference \
    --ckpt_path nvidia/Cosmos-Policy-RoboCasa-Predict2-2B \
    --config_file cosmos_policy/config/config.py \
    --use_wrist_image True \
    --num_wrist_images 1 \
    --use_proprio True \
    --normalize_proprio True \
    --unnormalize_actions True \
    --dataset_stats_path nvidia/Cosmos-Policy-RoboCasa-Predict2-2B/robocasa_dataset_statistics.json \
    --t5_text_embeddings_path nvidia/Cosmos-Policy-RoboCasa-Predict2-2B/robocasa_t5_embeddings.pkl \
    --trained_with_image_aug True \
    --chunk_size 32 \
    --num_open_loop_steps 16 \
    --task_name TurnOffMicrowave \
    --obj_instance_split A \
    --num_trials_per_task 2 \
    --local_log_dir cosmos_policy/experiments/robot/robocasa/logs/ \
    --seed 195 \
    --randomize_seed False \
    --deterministic True \
    --run_id_note smoke \
    --use_variance_scale False \
    --use_jpeg_compression True \
    --flip_images True \
    --num_denoising_steps_action 5 \
    --num_denoising_steps_future_state 1 \
    --num_denoising_steps_value 1 \
    --data_collection False

Step 3: Validate outputs

Confirm the eval log prints the expected task name, object split, and checkpoint/config values.
Inspect the final
```
Success rate:
```
line in the log.

Step 4: Expand scope

Increase

--num_trials_per_task

or add more tasks. Keep

--obj_instance_split

fixed across repeated runs for comparability.

Workflow 3: Blank-machine cluster launch

Cluster Launch Progress:
- [ ] Step 1: Clone the public repo and enter the supported runtime
- [ ] Step 2: Sync the benchmark-specific dependency group
- [ ] Step 3: Export rendering and cache environment variables before eval

Step 1: Clone and enter the supported runtime

git clone https://github.com/NVlabs/cosmos-policy.git
cd cosmos-policy
# Follow SETUP.md, start the Docker container, and enter it before continuing.

Step 2: Sync dependencies

uv sync --extra cu128 --group libero --python 3.10
# or, for RoboCasa:
uv sync --extra cu128 --group robocasa --python 3.10
# then install the Cosmos-compatible RoboCasa fork:
git clone https://github.com/moojink/robocasa-cosmos-policy.git
uv pip install -e robocasa-cosmos-policy

Step 3: Export runtime environment

export CUDA_VISIBLE_DEVICES=0
export MUJOCO_EGL_DEVICE_ID=0
export MUJOCO_GL=egl
export PYOPENGL_PLATFORM=egl
export HF_HOME=${HF_HOME:-$HOME/.cache/huggingface}
export TRANSFORMERS_CACHE=${TRANSFORMERS_CACHE:-$HF_HOME}

Expected performance benchmarks

Reference values from official evaluation (tied to specific setup and seeds):

Task Suite	Success Rate	Notes
LIBERO-Spatial	98.1%	Official LIBERO spatial result
LIBERO-Object	100.0%	Official LIBERO object result
LIBERO-Goal	98.2%	Official LIBERO goal result
LIBERO-Long	97.6%	Official LIBERO long-horizon result
LIBERO-Average	98.5%	Official average across LIBERO suites
RoboCasa	67.1%	Official RoboCasa average result

Reproduction note: Published success rates still depend on checkpoint choice, task suite, seeds, and simulator setup. Record the exact command and environment alongside any reported number.

Non-negotiable rules

EGL alignment: Always set

CUDA_VISIBLE_DEVICES

MUJOCO_EGL_DEVICE_ID

MUJOCO_GL=egl

, and

PYOPENGL_PLATFORM=egl

together on headless GPU nodes.

Official runtime first: If host-Python installs hit binary compatibility issues, fall back to the supported container workflow from
```
SETUP.md
```
before debugging package internals.
Cache consistency: Use the same cache directory across setup and eval so Hugging Face and dependency caches are reused.
Run comparability: Keep task name, object split, seed, and trial count fixed across repeated runs.

Common issues

Issue: binary compatibility or loader failures on host Python

Fix: rerun inside the official container/runtime from

SETUP.md

. Do not assume host-package rebuilds will match the public release environment.

Issue: LIBERO prompts for config path in a non-interactive shell

Fix: pre-create

LIBERO_CONFIG_PATH/config.yaml

import os, yaml

config_dir = os.path.expanduser("~/.libero")
os.makedirs(config_dir, exist_ok=True)
with open(os.path.join(config_dir, "config.yaml"), "w") as f:
    yaml.dump({"benchmark_root": "/path/to/libero/datasets"}, f)

Issue: EGL initialization or shutdown noise

Fix: align EGL environment variables first. Treat teardown-only

EGL_NOT_INITIALIZED

warnings as low-signal unless the job exits non-zero.

Issue: Kitchen object sampling NaNs or asset lookup failures in RoboCasa

Fix: rerun asset setup and confirm the patched robocasa install is intact:

python -m robocasa.scripts.download_kitchen_assets
python -c "import robocasa; print(robocasa.__file__)"

Issue: MuJoCo rendering mismatch

Fix: verify GPU device alignment:

import os
cuda_dev = os.environ.get("CUDA_VISIBLE_DEVICES", "not set")
egl_dev = os.environ.get("MUJOCO_EGL_DEVICE_ID", "not set")
assert cuda_dev == egl_dev, f"GPU mismatch: CUDA={cuda_dev}, EGL={egl_dev}"
print(f"Rendering on GPU {cuda_dev}")

Advanced topics

LIBERO command matrix: See references/libero-commands.md RoboCasa command matrix: See references/robocasa-commands.md

Resources

Cosmos Policy repository: https://github.com/NVlabs/cosmos-policy
LIBERO benchmark: https://github.com/Lifelong-Robot-Learning/LIBERO
Cosmos-compatible RoboCasa fork: https://github.com/moojink/robocasa-cosmos-policy
Upstream RoboCasa project: https://github.com/robocasa/robocasa
MuJoCo documentation: https://mujoco.readthedocs.io/