Language to Rewards for Robotic Skill Synthesis


  • Wenhao Yu
  • Nimrod Gileadi
  • Chuyuan Fu
  • Sean Kirmani
  • Kuang-Huei Lee

  • Montse Gonzalez Arenas
  • Hao-Tien Lewis Chiang
  • Tom Erez
  • Leonard Hasenclever

  • Jan Humplik
  • Brian Ichter
  • Ted Xiao
  • Peng Xu
  • Andy Zeng
  • Tingnan Zhang

  • Nicolas Heess
  • Dorsa Sadigh
  • Jie Tan
  • Yuval Tassa
  • Fei Xia




Video

Abstract

Large language models (LLMs) have demonstrated exciting progress in acquiring diverse new capabilities through in-context learning, ranging from logical reasoning to code-writing. Robotics researchers have also explored using LLMs to advance the capabilities of robotic control. However, since low-level robot actions are hardware-dependent and underrepresented in LLM training corpora, existing efforts in applying LLMs to robotics have largely treated LLMs as semantic planners or relied on human-engineered control primitives to interface with the robot. On the other hand, reward functions are shown to be flexible representations that can be optimized for control policies to achieve diverse tasks, while their semantic richness makes them suitable to be specified by LLMs. In this work, we introduce a new paradigm that harnesses this realization by utilizing LLMs to define reward parameters that can be optimized and accomplish variety of robotic tasks. Using reward as the intermediate interface generated by LLMs, we can effectively bridge the gap between high-level language instructions or corrections to low-level robot actions. Meanwhile, combining this with a real-time optimizer, MuJoCo MPC, empowers an interactive behavior creation experience where users can immediately observe the results and provide feedback to the system. To systematically evaluate the performance of our proposed method, we designed a total of 17 tasks for a simulated quadruped robot and a dexterous manipulator robot. We demonstrate that our proposed method reliably tackles 90% of the designed tasks, while a baseline using primitive skills as the interface with Code-as-policies achieves 50% of the tasks. We further validated our method on a real robot arm where complex manipulation skills such as non-prehensile pushing emerge through our interactive system.


Approach Overview

The recent rapid progress in Large Language Models (LLMs) has inspired notable developments in leveraging LLMs to drive robot behaviors: from step-by-step planning, goal-oriented dialogue, to robot-code-writing agents. While these methods impart new modes of compositional generalization, they focus on using language to concatenate together new behaviors from an existing library of control primitives that are either manually-engineered or learned a priori. On the other hand, leveraging LLMs to directly modulate low-level robot behavior still remains an open problem due to that low-level robot actions are hardware-dependent and underrepresented in LLM training corpora.
In this work, we aim to develop an interactive system that leverages the power of LLMs to acquire low-level robotic skills Our proposed system consists of two key components: i) a Reward Translator, built uponpre-trained Large Language Models (LLMs) [10], that interacts with and understands user intents andmodulates all reward parametersψand weightsw, and ii) a Motion Controller, based on MuJoCo MPC, that takes the generated reward and interactively optimize the optimal action sequence.

Reward Translator

We build the Reward Translator based on LLMs to map user interactions to reward functions corresponding to the desired robot motion. As reward tuning is highly domain-specific and requires expert knowledge, it is unsurprising that LLMs trained on generic language datasets cannot directly generate a reward for a specific hardware. Instead, we explore the in-context learning ability of LLMs to achieve this goal.
More concretely, we decompose the problem of language to reward into two stages: motion description and reward coding. During motion description stage, we design a prompt to instruct an LLM to interpret and expand the user input into a natural language description of the desired robot motion following a pre-defined template. While during the reward coding stage, we use another prompt to instruct the LLM to translate the motion description into reward specifying code.

Motion Controller

The Motion Controller needs to map the reward function generated by the Reward Translatorto low-level robot actions that maximize the accumulated reward specified by the Reward Translator. There are a few possible ways to achieve this, including using reinforcement learning (RL), offline trajectory optimization, or, as in this work, receding horizon trajectory optimization, i.e., model predictive control (MPC). Specifically, we use an open-source implementation based on the MuJoCo simulator, MJPC. At each control step, MJPC plans a sequence of optimized actions and sends to the robot. The robot applies the action corresponding to its current timestamp, advances to the next step, and sends the updated robot states to the MJPC planner to initiate the next planning cycle. The frequent re-planning in MPC empowers its robustness to uncertainties in the system and, importantly, enables interactive motion synthesis and correction.

Results

We evaluate our approach on two simulated robotic systems: a quadruped robot, and a dexterous robot manipulator. We design a diverse set of tasks for each robot to demonstrate the capability of our proposed system. Some examples of the resulting robot motions can be found below.

 User instruction: Flp the bowl. [sep]import numpy as np\\reset_reward()  # This is a new task so reset reward; otherwise we don't need it\\set_l2_distance_reward('palm', 'bowl')\\set_obj_orientation_reward('bowl', np.deg2rad(180))\\execute_plan(2)
User instruction: Turn on the faucet.[sep]import numpy as np\\reset_reward() # This is a new task so reset reward; otherwise we don't need it\\set_l2_distance_reward('palm', 'faucet_handle')\\set_joint_fraction_reward('faucet', 1) # Open the faucet\\execute_plan(4)
User instruction: Make the box upright.[sep]import numpy as np\\reset_reward()  # This is a new task so reset reward; otherwise we don't need it\\set_l2_distance_reward('palm', 'box')\\set_obj_orientation_reward('box', np.deg2rad(90))\\execute_plan()

Prompt text in gray.
L2R response shown within code block.

 User instruction: It's late in the afternoon, make the robot head towards the sunset.[sep]reset_reward()\\set_torso_targets(0.3, np.deg2rad(0), np.deg2rad(0), None, (0.5, 0.0), np.pi, None)\\\\set_feet_stepping_parameters('front_left', 2.0, 0.5, 0.0, 0.1, 0.1, True)\\set_feet_stepping_parameters('back_left', 2.0, 0.5, 0.5, 0.1, 0.1, True)\\set_feet_stepping_parameters('front_right', 2.0, 0.5, 0.5, 0.1, 0.1, True)\\set_feet_stepping_parameters('back_right', 2.0, 0.5, 0.0, 0.1, 0.1, True)\\\\execute_plan()
User instruction: Stand up on two back feet like a human.[sep]import numpy as np

reset_reward()
set_torso_targets(0.65, np.deg2rad(90), np.deg2rad(0), (0.0, 0.0), None, 0, None)

set_feet_pos_parameters('front_left', 0.65, None, None)
set_feet_pos_parameters('back_left', 0.0, None, None)
set_feet_pos_parameters('front_right', 0.65, None, None)
set_feet_pos_parameters('back_right', 0.0, None, None)

execute_plan()
User instruction: Sit like a dog.[sep]import numpy as np

reset_reward()
set_torso_targets(0.15, np.deg2rad(45), np.deg2rad(0), (0.0, 0.0), None, 0, None)

set_feet_pos_parameters('front_left', 0.0, None, None)
set_feet_pos_parameters('back_left', 0.0, None, None)
set_feet_pos_parameters('front_right', 0.0, None, None)
set_feet_pos_parameters('back_right', 0.0, None, None)

execute_plan()

We further compare our method against two baselines: 1) a Reward Coder only baseline where an LLM directly map user instructions to reward code instead of going through the Motion Descriptor, and 2) a Code-as-Policies baseline where the LLM generates a plan for the robot motion using a set of pre-defined robot primitive skills instead of reward functions. For the Code-as-Policies (CaP) baseline, we design the primitive skills based on common commands available to the robot.
As shown below, our proposed approach achieves notably higher success rate for 11/17 task categories and comparable performance for the rest tasks, showing the effectiveness and reliability of the proposed method.

We further showcase two examples where we teach the robot to perform complex tasks through multiple rounds of interactions.

Instruction 1: Make the robot stand upright on two back feet like a human.
Instruction 2: Good, you actually don't need to keep the front paws at certain height, just leave it to the controller.
Instruction 3: Good, now make the robot do a moonwalk.
Instruction 4: Moon walk means the robot should walk backward while the feet swings as if they are moving forward. Correct your answer.

Instruction 1: Open the drawer.
Instruction 2: Good, now put the apple inside the drawer while keep it open.
Instruction 3: Good, now release the apple and move hand away.
Instruction 4: Now close the drawer.


Validation on real hardware.

We implement a version of our method onto a mobile manipulator. We detect objects in image-space using an open-vocabulary detector: F-VLM. We extract the associated points from point cloud behind the mask and perform outlier rejection for points that might belong to the background. From a birds-eye view, we fit a minimum volume rectangle and take the extremes to determine the extent in the z-axis. We demonstrate sim-to-real transfer on two tasks: object pushing and object grasping. Our system is able to generate relevant reward code and the Mujoco MPC is able to synthesize the pushing and grasping motion.

Prompts

Quadruped: Motion Descriptor | Reward Coder

Dexterous Manipulator: Motion Descriptor | Reward Coder

Real Manipulator: Motion Descriptor | Reward Coder

Citation

Acknowledgements

The authors would like to acknowledge Ken Caluwaerts, Kristian Hartikainen, Steven Bohez, Carolina Parada, Marc Toussaint, and the greater teams at Google DeepMind for their feedback and contributions.

The website template was borrowed from Jon Barron.