SimVLA: A Simple VLA Baseline for Robotic Manipulation

Yuankai Luo, Woping Chen, Tong Liang, Baiqiao Wang, Zhenguo Li

A streamlined Vision-Language-Action (VLA) baseline for robotic manipulation, designed for transparency and reproducibility.

Overview

A concise summary of the motivation, contributions, and key takeaways.

Vision-Language-Action (VLA) models have emerged as a promising paradigm for general-purpose robotic manipulation, leveraging large-scale pre-training to achieve strong performance. The field has rapidly evolved with additional spatial priors and diverse architectural innovations. However, these advancements are often accompanied by varying training recipes and implementation details, which can make it challenging to disentangle the precise source of empirical gains. In this work, we introduce SimVLA, a streamlined baseline designed to establish a transparent reference point for VLA research. By strictly decoupling perception from control—using a standard vision-language backbone and a lightweight action head—and standardizing critical training dynamics, we demonstrate that a minimal design can achieve state-of-the-art performance. Despite having only 0.5B parameters, SimVLA outperforms multi-billion-parameter models on standard simulation benchmarks without robot pretraining. SimVLA also reaches on-par real-robot performance compared to π0.5. Our results establish SimVLA as a robust, reproducible baseline that enables clear attribution of empirical gains to future architectural innovations.

Main contributions

Modular Design

We propose SimVLA, a modular VLA baseline that decouples perception from control, enabling a flexible and future-proof design that can easily adapt to new vision-language backbones.

Standardized Recipe

We identify and standardize the “silent” drivers of VLA performance—specifically data shuffling, normalization, and optimization dynamics—providing a rigorous training recipe that enables fair cross-model comparisons.

SOTA Performance

We show that this minimal design achieves state-of-the-art performance, surpassing larger and more complex models on simulation benchmarks while enabling efficient real-robot transfer with zero-shot scene generalization.

Model architecture

A minimal “VLM encoder + lightweight action head” design, with perception and control strictly decoupled.

Design principle

SimVLA keeps the architecture intentionally minimal to serve as a clean baseline. The vision-language backbone is executed once per control step, and the action head handles the denoising iterations efficiently.

Implementation notes

  • Perception and control are decoupled: a standard VLM encoder produces fused vision-language tokens.
  • A lightweight action transformer performs flow-matching denoising to generate continuous action chunks.
  • The modular design makes it straightforward to swap backbones while keeping the action head comparable.
SimVLA overview. SimVLA is a minimal baseline: a VLM encoder produces fused vision-language tokens once per control step, and a lightweight action transformer performs flow-matching denoising to generate a continuous action chunk.

Training and inference recipe

Standardizing critical training dynamics for fair comparison.

A central takeaway of this work is that strong VLA performance can often be achieved through careful standardization of training and inference details, even with minimal architectural design. In practice, we find that several seemingly minor choices can dominate performance differences if left under-specified. Accordingly, we explicitly control and report the following factors across all experiments.

Action Representation and Normalization

We train the flow model in a normalized continuous action space, using per-dimension statistics computed from the training set. Proprioceptive states are normalized when applicable to improve optimizer conditioning. We predict action chunks of horizon H and execute them in a receding-horizon manner; we emphasize that the choice of H is a major performance knob and must be tuned per benchmark.

Data Handling

Beyond action chunking, we carefully control data shuffling during training. Since demonstration trajectories exhibit strong temporal correlations, improper shuffling can lead to brittle optimization and poor long-horizon generalization. We find that consistent shuffling is critical for stable training and fair benchmarking.

Optimization Dynamics

We systematically sweep learning rates, warm-up schedules, and learning rate schedulers while keeping batch size and total training steps fixed across comparisons. Notably, we observe that learning rate selection alone can overshadow architectural differences if not properly tuned, underscoring the importance of reporting optimization details for reproducibility.

Architecture Configuration

While SimVLA employs a minimal action head by default, we ablate action transformer scale, VLM backbone choice, and information injection mechanisms (token concatenation, cross-attention, and conditional normalization). We view these variations as implementation choices rather than architectural novelties, and we report them to contextualize performance differences.

Evaluation

Experimental results organized to match the paper (simulation → robot benchmarks → real robot).

Simulation results

LIBERO benchmark performance and LIBERO-PRO robustness. Click any table to open the full-size image.

Comparison on the LIBERO benchmark

Success rate (%) on the official test episodes for each suite (Spatial/Object/Goal/Long) and the overall average.

Comparison on the LIBERO benchmark. We report the success rate (%) on the official test episodes for each suite (Spatial/Object/Goal/Long), and the overall average across the four suites.

Robustness on the LIBERO-PRO benchmark

Robustness under perturbations across five dimensions: Ori / Obj / Pos / Sem / Task.

Robustness evaluation on the LIBERO-PRO benchmark. We report success rate (%) across five perturbation dimensions: Original (Ori), Object (Obj), Position (Pos), Semantic (Sem), and Task (Task).

Robot benchmark results

Results on WidowX and Google Robot tasks.

Comparison on WidowX robot tasks

Success rates (%) across representative WidowX tasks.

Comparison on WidowX robot tasks; success rates (%).

Comparison on Google Robot tasks

Success rates (%) on three tasks and the overall average.

Comparison on Google Robot tasks; success rates (%).

Real-robot results

Zero-shot evaluation on Galaxea R1 Lite, plus qualitative deployment snapshots.

Qualitative examples

Out-of-box real-robot task examples. We deploy SimVLA without any additional fine-tuning on held-out scenes and evaluate it on multi-stage tasks that require both dexterous manipulation and semantic understanding.

Quantitative results

Real-robot zero-shot results on Galaxea R1 Lite.

Ablation studies

Analysis of key architectural decisions on the LIBERO benchmark.

Ablations on LIBERO

Each row corresponds to one ablation setting with the remaining knobs fixed to the default configuration.

Table 6. Ablations on LIBERO. Each row corresponds to one ablation setting with the remaining knobs fixed to the default configuration.

Key Findings

Table 6 highlights a few dominant knobs that largely determine performance.

Data shuffling and normalization are critical.

Disabling either shuffling or action normalization causes a near-collapse in performance, suggesting that stable optimization and consistent action scaling are prerequisites for a strong baseline.

Optimization dynamics dominate.

The learning rate must be tuned: too large (5×10⁻⁴) degrades sharply, while too small (5×10⁻⁵) also underperforms. Likewise, removing the small VLM learning-rate multiplier hurts substantially, indicating that preserving the pretrained backbone is important.

Some architecture choices matter, but are secondary.

Scaling down the action head (large→small) only slightly reduces performance, whereas alternative conditioning mechanisms (AdaLN / cross-attention) are noticeably worse than simple token concatenation under our setup.

Performance & efficiency

We compare SimVLA against state-of-the-art baselines (OpenVLA, π0) under a strictly matched evaluation setup. SimVLA outperforms larger models while requiring significantly fewer resources.

Representative example comparing LIBERO average success and peak training VRAM under a matched setup.

Citation

BibTeX entry for this work.

BibTeX

@misc{luo2026simvlasimplevlabaseline,
      title={SimVLA: A Simple VLA Baseline for Robotic Manipulation}, 
      author={Yuankai Luo and Woping Chen and Tong Liang and Baiqiao Wang and Zhenguo Li},
      year={2026},
      eprint={2602.18224},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2602.18224}, 
}