LIBERO-Safety benchmark probes physical and semantic risk in vision-language-action models

A new arXiv preprint introduces LIBERO-Safety, a benchmark for physical and semantic safety in vision-language-action models. The paper uses a keypose-driven data pipeline, reports 19,664 collision-free demonstrations, and says tests across 10 embodied models expose a tradeoff between broader training diversity, trajectory safety and task success.

LIBERO-Safety is a new benchmark aimed at one of the hardest open questions in embodied AI: whether vision-language-action models can act safely, not just effectively.

The preprint appeared on arXiv on June 22, 2026. It presents a parametric safety benchmark that procedurally generates safety-critical scenarios with stochasticity, giving researchers a way to test manipulation policies under conditions that are meant to surface both physical and semantic failures.

The work extends the earlier LIBERO benchmark family, which was built for lifelong robot learning and knowledge transfer, toward a more direct safety agenda. In the authors' framing, the field has made progress on capability, but evaluation has not kept pace with the consequences of unsafe action in real-world settings.

What the benchmark adds

At the center of LIBERO-Safety is a keypose-driven data generation pipeline designed to scale beyond human teleoperation. The authors say that approach helps them generate safety-relevant data more efficiently than manual demonstration alone, while preserving control over the scenarios being tested.

The resulting dataset includes 19,664 strictly collision-free demonstrations, together with extensive domain randomization. That combination is intended to test whether models can generalize without leaning on brittle scene assumptions or narrow training exposure.

The benchmark is evaluated on 10 embodied systems in total: eight vision-language-action models and two embodied foundation models. According to the paper, the evaluation is broad enough to show that safety performance cannot be reduced to a single model architecture or training recipe.

What the results suggest

The authors report a generalization-safety tension. In their results, higher-diversity training appears to improve trajectory safety, but that gain does not translate cleanly into fully reliable task execution.

They point to two limiting factors: sub-optimal trajectory synthesis and semantic misalignment. In other words, a model can become safer in the sense of avoiding bad motion while still failing to complete the task well, or misunderstanding the intent behind the instruction.

That distinction matters for labs building manipulation policies and embodied foundation models. A system that scores well on task completion alone may still be unsuitable for deployment if it cannot consistently manage physical risk or align actions with intent.

The paper also underscores why benchmark design matters in this area. Safety evaluation for embodied agents is harder than ordinary classification testing because the outputs are actions in the physical world, where a mistake can have immediate consequences.

Why it matters now

LIBERO-Safety is a research-stage preprint, not a peer-reviewed release, so its longer-term impact will depend on whether the authors publish code, data or a project page, and on whether other groups can reproduce the reported pattern.

Even so, the benchmark adds a concrete testbed to a part of AI where safety claims are often difficult to verify. It gives the field a way to ask whether stronger generalization actually makes manipulation policies safer, or only more capable in narrow settings.

The broader context is that LIBERO has already been used as a benchmark family for robot learning research, and this new work pushes that lineage toward safety rather than only transfer and task performance. The authors' results suggest those goals are related, but not interchangeable.

What comes next is straightforward to watch: code release, data release, project-page publication and any move toward a peer-reviewed venue. Independent replication will be the clearest test of whether the safety-success tradeoff holds beyond the original paper.

For now, LIBERO-Safety is a useful signal for the embodied AI field. It argues that better training diversity can help policies behave more safely, but it does not automatically solve the harder problem of reliable, semantically aligned action.

Revision note

Initial automated publication.