Robot Control Stack

Abstract

Vision-Language-Action models (VLAs) mark a major shift in robot learning. They replace specialized architectures and task-tailored components of expert policies with large-scale data collection and setup-specific fine-tuning. In this machine learning-focused workflow that is centered around models and scalable training, traditional robotics software frameworks become a bottleneck, while robot simulations offer only limited support for transitioning from and to real-world experiments. In this work, we close this gap by introducing Robot Control Stack (RCS), a lean ecosystem designed from the ground up to support research in robot learning with large-scale generalist policies. At its core, RCS features a modular and easily extensible layered architecture with a unified interface for simulated and physical robots, facilitating sim-to-real transfer. Despite its minimal footprint and dependencies, it offers a complete feature set, enabling both real-world experiments and large-scale training in simulation. Our contribution is twofold: First, we introduce the architecture of RCS and explain its design principles. Second, we evaluate its usability and performance along the development cycle of VLA and RL policies. Our experiments also provide an extensive evaluation of Octo, OpenVLA, and Pi Zero on multiple robots and shed light on the benefits of simulated data for robotic foundation models.

RCS in 3 Minutes

Traditional robotics is built around hardware, with many interacting parts and specialized AI modules. With machine learning taking the lead, this relationship flips around: robots are components of a machine learning pipeline.

Many libraries embrace this and adopt a Python- and ML-first approach, but they often lack robust robotics features and hardware support. Robust policies require careful debugging in both simulation and hardware, which relies on classical robotics tools.

RCS bridges this gap by combining an ML-first design with the essential robotics tools. It gives you the means to debug interfaces, validate tasks, and test directly on hardware—while remaining a lightweight pip-installable package with minimal dependencies.

Architecture

C++/Python API We provide device APIs in C++ with automatically generated Python bindings, ensuring mirrored functionality in both languages. A new device can be integrated into RCS in either C++ or in Python, ensuring broad hardware compatibility.

Composable scenes Higher-level abstractions are built on top of our own device APIs. They leverage Gymnasium wrappers to enable modular scene creation through composition.

Layered architecture Because we build upon a minimal low-level device API, you can quickly get up and running with new hardware: implement our interface, benefit from all the wrappers and apps higher up in the stack.

Fig. 1: Applications (teleoperation, RL, VLA) interface with the environment (sim or real) through a unified Gymnasium API. Sensors, actuators, and observers wrap the environment, mutating action/observation spaces.

Robot Setups with Digital Twins

We evaluate the usability of RCS's hardware oriented features by integrating multiple setups with different robots, grippers, cameras and touch sensors. In total, four robots, four end-effectors, two cameras and a tactile sensor are implemented, both in simulation and on physical hardware.

FR3 + Franka Hand; wrist & side cameras.

xArm7 + Tilburg Hand; side camera.

UR5e + Robotiq 2F-85; wrist & overhead cameras.

SO101 + built-in gripper; wrist & side cameras.

Applications

All implemented robots can be teleoperated with multiple devices and can be used to record data. We also verify that RCS integrates cleanly into ML pipelines, both in the imitation learning and reinforcement learning settings. We deploy multiple VLAs, and solve a simple simulated pick-up task with PPO, using proprioceptive and RGB states as observations.

Teleoperation & Data Collection

HTC Vive

Meta Quest 3

SpaceMouse

Leader-Follower

Scripted Data Collection

Reinforcement Learning

VLA Inference

Real

Simulation

Results

We demonstrate how RCS supports VLA research by investigating VLA generalization across multiple embodiments and assessing the benefit of simulated data for robotic foundation models.

Fig. 2: We fine-tune Pi Zero on four datasets from different setups. Each dataset contains fewer than 150 episodes. The fine-tuned models are deployed on the corresponding setups. The robots that are more prominent in the base model's data mix achieve better results.

Success rate plot over training checkpoints.

Fig. 3: We investigate the impact of simulated data on VLA performance. Our setup is replicated in simulation and used to generate 500 trajectories using a scripted policy, which is then used to complement our manually collected dataset of 143 trajectories. The plots show the success rate of the policy, both in the simulated scene and on the hardware, as training progresses. Success rates in simulation correlate with success rates on the physical robot—consistent with a good evaluation metric. Adding simulated data to the training mix improves performance in both settings.

BibTeX

@misc{juelg2025robotcontrolstack,
  title={{Robot Control Stack}: {A} Lean Ecosystem for Robot Learning at Scale}, 
  author={Tobias J{\"u}lg and Pierre Krack and Seongjin Bien and Yannik Blei and Khaled Gamal and Ken Nakahara and Johannes Hechtl and Roberto Calandra and Wolfram Burgard and Florian Walter},
  year={2025},
  howpublished = {\url{https://arxiv.org/abs/2509.14932}}
}

A Lean Ecosystem for Robot Learning at Scale