Project Logo

A Lean Ecosystem for Robot Learning at Scale

Tobias Jülg*1, Pierre Krack*1, Seongjin Bien*1, Yannik Blei1, Khaled Gamal1, Ken Nakahara2, Johannes Hechtl1,3, Roberto Calandra2, Wolfram Burgard1 and Florian Walter1,4

1University of Technology Nuremberg,
2TU Dresden, 3Siemens AG, 4Technical University of Munich
*Equal Contribution

Abstract

Vision-Language-Action models (VLAs) mark a major shift in robot learning. They replace specialized architectures and task-tailored components of expert policies with large-scale data collection and setup-specific fine-tuning. In this machine learning-focused workflow that is centered around models and scalable training, traditional robotics software frameworks become a bottleneck, while robot simulations offer only limited support for transitioning from and to real-world experiments. In this work, we close this gap by introducing Robot Control Stack (RCS), a lean ecosystem designed from the ground up to support research in robot learning with large-scale generalist policies. At its core, RCS features a modular and easily extensible layered architecture with a unified interface for simulated and physical robots, facilitating sim-to-real transfer. Despite its minimal footprint and dependencies, it offers a complete feature set, enabling both real-world experiments and large-scale training in simulation. Our contribution is twofold: First, we introduce the architecture of RCS and explain its design principles. Second, we evaluate its usability and performance along the development cycle of VLA and RL policies. Our experiments also provide an extensive evaluation of Octo, OpenVLA, and Pi Zero on multiple robots and shed light on the benefits of simulated data for robotic foundation models.

RCS in 3 Minutes


Traditional robotics is built around hardware, with many interacting parts and specialized AI modules. With machine learning taking the lead, this relationship flips around: robots are components of a machine learning pipeline.


Many libraries embrace this and adopt a Python- and ML-first approach, but they often lack robust robotics features and hardware support. Robust policies require careful debugging in both simulation and hardware, which relies on classical robotics tools.


RCS bridges this gap by combining an ML-first design with the essential robotics tools. It gives you the means to debug interfaces, validate tasks, and test directly on hardware—while remaining a lightweight pip-installable package with minimal dependencies.

Architecture

C++/Python API   We provide device APIs in C++ with automatically generated Python bindings, ensuring mirrored functionality in both languages. A new device can be integrated into RCS in either C++ or in Python, ensuring broad hardware compatibility.


Composable scenes   Higher-level abstractions are built on top of our own device APIs. They leverage Gymnasium wrappers to enable modular scene creation through composition.


Layered architecture   Because we build upon a minimal low-level device API, you can quickly get up and running with new hardware: implement our interface, benefit from all the wrappers and apps higher up in the stack.


RCS Architecture.
Fig. 1: Applications (teleoperation, RL, VLA) interface with the environment (sim or real) through a unified Gymnasium API. Sensors, actuators, and observers wrap the environment, mutating action/observation spaces.

Robot Setups with Digital Twins

We evaluate the usability of RCS's hardware oriented features by integrating multiple setups with different robots, grippers, cameras and touch sensors. In total, four robots, four end-effectors, two cameras and a tactile sensor are implemented, both in simulation and on physical hardware.


Applications


All implemented robots can be teleoperated with multiple devices and can be used to record data. We also verify that RCS integrates cleanly into ML pipelines, both in the imitation learning and reinforcement learning settings. We deploy multiple VLAs, and solve a simple simulated pick-up task with PPO, using proprioceptive and RGB states as observations.




Teleoperation & Data Collection

HTC Vive

Meta Quest 3

SpaceMouse

Leader-Follower

Scripted Data Collection



Reinforcement Learning



VLA Inference

Real

Simulation

Results

We demonstrate how RCS supports VLA research by investigating VLA generalization across multiple embodiments and assessing the benefit of simulated data for robotic foundation models.


Descriptive alt text
Fig. 2: We fine-tune Pi Zero on four datasets from different setups. Each dataset contains fewer than 150 episodes. The fine-tuned models are deployed on the corresponding setups. The robots that are more prominent in the base model's data mix achieve better results.
Success rate plot over training checkpoints.
Fig. 3: We investigate the impact of simulated data on VLA performance. Our setup is replicated in simulation and used to generate 500 trajectories using a scripted policy, which is then used to complement our manually collected dataset of 143 trajectories. The plots show the success rate of the policy, both in the simulated scene and on the hardware, as training progresses. Success rates in simulation correlate with success rates on the physical robot—consistent with a good evaluation metric. Adding simulated data to the training mix improves performance in both settings.

BibTeX

@misc{juelg2025robotcontrolstack,
  title={{Robot Control Stack}: {A} Lean Ecosystem for Robot Learning at Scale}, 
  author={Tobias J{\"u}lg and Pierre Krack and Seongjin Bien and Yannik Blei and Khaled Gamal and Ken Nakahara and Johannes Hechtl and Roberto Calandra and Wolfram Burgard and Florian Walter},
  year={2025},
  howpublished = {\url{https://arxiv.org/abs/2509.14932}}
}