Language to Rewards for Robotic Skill Synthesis
CoRL 2023 (Oral)
Video
Abstract
Large language models (LLMs) have demonstrated exciting progress in acquiring diverse new capabilities through in-context learning, ranging from logical reasoning to code-writing. Robotics researchers have also explored using LLMs to advance the capabilities of robotic control. However, since low-level robot actions are hardware-dependent and underrepresented in LLM training corpora, existing efforts in applying LLMs to robotics have largely treated LLMs as semantic planners or relied on human-engineered control primitives to interface with the robot. On the other hand, reward functions are shown to be flexible representations that can be optimized for control policies to achieve diverse tasks, while their semantic richness makes them suitable to be specified by LLMs. In this work, we introduce a new paradigm that harnesses this realization by utilizing LLMs to define reward parameters that can be optimized and accomplish variety of robotic tasks. Using reward as the intermediate interface generated by LLMs, we can effectively bridge the gap between high-level language instructions or corrections to low-level robot actions. Meanwhile, combining this with a real-time optimizer, MuJoCo MPC, empowers an interactive behavior creation experience where users can immediately observe the results and provide feedback to the system. To systematically evaluate the performance of our proposed method, we designed a total of 17 tasks for a simulated quadruped robot and a dexterous manipulator robot. We demonstrate that our proposed method reliably tackles 90% of the designed tasks, while a baseline using primitive skills as the interface with Code-as-policies achieves 50% of the tasks. We further validated our method on a real robot arm where complex manipulation skills such as non-prehensile pushing emerge through our interactive system.
Approach Overview
The recent rapid progress in Large Language Models (LLMs) has inspired notable developments in
leveraging LLMs to drive robot behaviors:
from step-by-step planning, goal-oriented dialogue, to robot-code-writing agents. While these
methods impart new modes of compositional
generalization, they focus on using language to concatenate together new behaviors from an existing
library of control primitives that are
either manually-engineered or learned a priori. On the other hand, leveraging LLMs to directly
modulate low-level robot behavior still remains an open problem
due to that low-level robot actions are hardware-dependent and underrepresented in LLM training
corpora.
In this work, we aim to develop an interactive system that leverages the power of LLMs to acquire
low-level robotic skills
Our proposed system consists of two key components: i) a Reward Translator, built uponpre-trained
Large Language Models (LLMs) [10],
that interacts with and understands user intents andmodulates all reward parametersψand weightsw,
and ii) a Motion Controller,
based on MuJoCo MPC, that takes the generated reward and interactively optimize the optimal action
sequence.
Reward Translator
We build the Reward Translator based on LLMs to map user interactions to reward functions
corresponding to the desired robot motion.
As reward tuning is highly domain-specific and requires expert knowledge, it is unsurprising that
LLMs trained on generic language datasets
cannot directly generate a reward for a specific hardware. Instead, we explore the in-context
learning ability of LLMs to achieve this goal.
More concretely, we decompose the problem of language to reward into two stages: motion description
and reward coding. During motion description stage,
we design a prompt to instruct an LLM to interpret and expand the user input into a natural language
description of the desired robot motion following a pre-defined template.
While during the reward coding stage, we use another prompt to instruct the LLM to translate the
motion description into reward specifying code.
Motion Controller
The Motion Controller needs to map the reward function generated by the Reward Translatorto low-level robot actions that maximize the accumulated reward specified by the Reward Translator. There are a few possible ways to achieve this, including using reinforcement learning (RL), offline trajectory optimization, or, as in this work, receding horizon trajectory optimization, i.e., model predictive control (MPC). Specifically, we use an open-source implementation based on the MuJoCo simulator, MJPC. At each control step, MJPC plans a sequence of optimized actions and sends to the robot. The robot applies the action corresponding to its current timestamp, advances to the next step, and sends the updated robot states to the MJPC planner to initiate the next planning cycle. The frequent re-planning in MPC empowers its robustness to uncertainties in the system and, importantly, enables interactive motion synthesis and correction.
Results
We evaluate our approach on two simulated robotic systems: a quadruped robot, and a dexterous robot manipulator. We design a diverse set of tasks for each robot to demonstrate the capability of our proposed system. Some examples of the resulting robot motions can be found below.
L2R response shown within code block.
We further compare our method against two baselines: 1) a Reward Coder only baseline
where an LLM directly map user instructions to reward code instead of going through the Motion
Descriptor, and 2) a Code-as-Policies baseline
where the LLM generates a plan for the robot motion using a set of pre-defined robot primitive
skills instead of reward functions.
For the Code-as-Policies (CaP) baseline, we design the primitive skills based on common commands
available to the robot.
As shown below, our proposed approach achieves notably higher success rate for 11/17 task
categories and comparable performance for the rest tasks,
showing the effectiveness and reliability of the proposed method.
We further showcase two examples where we teach the robot to perform complex tasks through multiple rounds of interactions.
Instruction 1: Make the robot stand upright on two back feet like a human.
Instruction 2: Good, you actually don't need to keep the front paws at certain height, just
leave it to the controller.
Instruction 3: Good, now make the robot do a moonwalk.
Instruction 4: Moon walk means the robot should walk backward while the feet swings as if they
are moving forward. Correct your answer.
Instruction 1: Open the drawer.
Instruction 2: Good, now put the apple inside the drawer while keep it open.
Instruction 3: Good, now release the apple and move hand away.
Instruction 4: Now close the drawer.
Validation on real hardware.
We implement a version of our method onto a mobile manipulator. We detect objects in image-space using an open-vocabulary detector: F-VLM. We extract the associated points from point cloud behind the mask and perform outlier rejection for points that might belong to the background. From a birds-eye view, we fit a minimum volume rectangle and take the extremes to determine the extent in the z-axis. We demonstrate sim-to-real transfer on two tasks: object pushing and object grasping. Our system is able to generate relevant reward code and the Mujoco MPC is able to synthesize the pushing and grasping motion.
Prompts
Quadruped: Motion Descriptor | Reward Coder
Dexterous Manipulator: Motion Descriptor | Reward Coder
Real Manipulator: Motion Descriptor | Reward Coder
Citation
Acknowledgements
The authors would like to acknowledge Ken Caluwaerts, Kristian Hartikainen, Steven Bohez,
Carolina Parada, Marc Toussaint, and the greater teams at Google DeepMind for their feedback and
contributions.
The website template was borrowed from Jon Barron.