ProgPrompt: Generating Situated Robot Task Plans using Large Language Models

ICRA 2023

Extended version in Autonomous Robots 2023

1University of Southern California, 2NVIDIA

Abstract

Task planning can require defining myriad domain knowledge about the world in which a robot needs to act. To ameliorate that effort, large language models (LLMs) can be used to score potential next actions during task planning, and even generate action sequences directly, given an instruction in natural language with no additional domain information. However, such methods either require enumerating all possible next steps for scoring, or generate free-form text that may contain actions not possible on a given robot in its current context. We present a programmatic LLM prompt structure that enables plan generation functional across situated environments, robot capabilities, and tasks. Our key insight is to prompt the LLM with program-like specifications of the available actions and objects in an environment, as well as with example programs that can be executed. We make concrete recommendations about prompt structure and generation constraints through ablation experiments, demonstrate state of the art success rates in VirtualHome household tasks, and deploy our method on a physical robot arm for tabletop tasks.

Video

ProgPrompt

We introduce a prompting method that goes beyond conditioning LLMs in natural language, utilizing programming language structures, leveraging the fact that LLMs are trained on several open-source codebases. ProgPrompt provides an LLM with a pythonic program header that imports available actions and their arguments, shows a list of environment objects, followed by multiple example task plans, formatted as pythonic functions. The function name is the task specification, and the function implementation is an example task plan. The plan consists of comments, actions, and assertions. We use comments to group multiple high-level actions together, similar to chain-of-thought reasoning. Actions are expressed as imported function calls. Assertions check for action pre-conditions, and trigger recovery actions. Finally we append an incomplete function definition for the LLM to complete. The plan is interpreted by executing actions in the env and asserting preconditions using the LLM.

Results

Real Robot Demo
Task: sort fruits on the plate and bottle in the box

VirtualHome Demo
Task: microwave salmon

VirtualHome Results

bring coffeepot and cupcake to the coffee table brush teeth eat chips on the sofa
make toast put salmon in the fridge throw away apple
turn off light wash the plate. watch tv

Full Prompt

Generated Task Programs

FAQs

How does this approach compare with end-to-end robot learning models, and what are the current limitations?

ProgPrompt is a hierarchical solution to task planning where the abstract task descriptions leverage LLM’s reasoning and maps the task plan to the grounded environment labels. On the other hand, in end-to-end approaches, generally the model implicitly learns reasoning, planning, and grounding, while mapping the abstract task description to the action space directly.
Pros:

  1. LLMs can do long-horizon planning from an abstract task description.
  2. Decoupling the LLM planner from the environment makes generalization to new tasks and environments feasible.
  3. ProgPrompt enables LLMs to intelligently combine the robot capabilities with the environment and their own reasoning ability to generate an executable and valid task plan.
  4. The precondition checking helps recover from some failure modes that can happen if actions are generated in the wrong order or are missed by the base plan.

Cons:
  1. Requires action space discretization, formalization of environments and objects.
  2. Plan generation is open-loop, with commonsense precondition checking-based environment interaction.
  3. Plan generation doesn't consider low-level continuous aspects of the environment state, and only reasons with the semantic state for planning as well as precondition checking.
  4. The amount of information exchange between language models and other modules such as the robot's perceptual or proprioceptive state encoders is limited, since API-based access to these recent LLMs only allows textual queries. However, this is still promising as it indicates the need for a multimodal encoder that can work with input such as vision, touch, force, temperature, etc.

How does it compare with the concurrent work: Code-as-Policies (CaP)?

  1. We believe that the general approach is quite similar to ours. CaP defines Hints and Examples which may correspond to Imports/Object lists and Task Plan examples in ProgPrompt .
  2. CaP uses actions as API calls with certain parameters for the calls such as robot arm pose, velocity, etc. We use actions as API calls with objects as parameters.
  3. CaP uses APIs to obtain environment information as well, like object pose or segmentation, for the purpose of plan generation. However, ProgPrompt extracts environment information via precondition checking on current environment state, to ensure plan executability. ProgPrompt also generates the prompt conditioned on information from perception models.

During ``PROMPT for State Feedback”, it seems that the prompt already includes all information about the environment state. Is it necessary to prompt the LLM again for the assertion (compared to a simple rule-based algorithm)?

  1. The environment state input to the model is not the full state for brevity. Thus, checking pre-conditions with the full state separately helps, as shown in Table 1 in the paper.
  2. The environment state could change during execution.
  3. Using LLM as opposed to a rule-based algorithm is a design choice made to keep the approach more general, instead of using a hand-coded rule-based algorithm. The assertion checking may also be replaced with a visual state conditioned module, when a semantic state is not available, such as in the real-world scenario. However, we leave these aspects to be addressed in future research.

Is it possible that the generated code will lead the robot to be stuck in an infinite loop?

LLM code generation could lead to loops by predicting the same actions repeatedly as a generation artifact. LLMs used to suffer from such degeneration, but with latest LLMs (i.e. GPT-3) we have not encountered it at all.

Why are real-robot experiments simpler than virtual experiments?

The real-robot experiments were done as a demonstration of the approach on a real-robot, while studying the method in depth in a virtual simulator, for the sake of simplicity and efficiency.

What’s the difference between various GPT3 model versions used in this project?

We name 'GPT3', which is the latest available version of GPT3 model on OpenAI at the time the paper was written: 'text-davinci-002'. We name 'davinci' as the original version of GPT3 released: 'text-davinci'. More info on GPT3 models variations and naming can be found here.

Why not a planning language like PDDL (or other planning languages) be used to construct ProgPrompt? Any advantages of using a pythonic structure?

  1. GPT-3 has been trained on data from the internet. There is a lot of python code on the internet, while PDDL is a language of much more narrow interest. Thus, we expect the LLM to better understand python syntax.
  2. Python is a general purpose language, so it has more features than PDDL. Furthermore, we want to avoid specifying the full planning domain, instead relying on the knowledge learned by the LLM to make common-sense inferences. A recent work (Translating Natural Language to Planning Goals with Large-Language Models) uses LLMs to generate PDDL goals, however, it requires full domain specification for a given environment.
  3. Python is an accessible language that a larger community is familiar with.

How to handle multiple instances of the same object type in the scene?

ProgPrompt doesn’t tackle the issue, however, Translating Natural Language to Planning Goals with Large-Language Models shows that multiple instances of the same objects can be handled by using labels with object IDs such as 'book_1, book_2'.

Why doesn't the paper compare the performance of the proposed method to InnerMonologue, SAYCAN, or Socratic models?

At the time of writing, the dataset or model from the above papers were not public. However, we do compare with a proxy approach, similar in underlying idea to the above approaches, in the VirtualHome environment. 'LangPlan' in our baselines, uses GPT3 to get textual plan steps, which are then executed using a GPT-2 based trained policy.

So the next step in this direction of research is to create highly structured inputs and outputs that could be compiled, since eventually we want something that compiles on robotic machines?

The disconnect and information bottleneck between LLM planning module and skill execution module might make it less concrete on “how much” and “what” information should be passed through the LLM during planning. That said, we think that this would be an interesting direction to pursue and test the limits of LLM’s highly structured input understanding and generation.

How does it compare to a classical planner?

  1. Classical planners require concrete goal condition specification. An LLM planner reasons out a feasible goal state from a high level task description, such as “microwave salmon”. From a user’s perspective, it is desirable to not have to specify a concrete semantic goal state of the environment and just be able to give an instruction to act on.
  2. The search space would also be huge without common sense priors that an LLM planner leverages as opposed to a classical planner. Moreover, we also bypass the need to specify the domain knowledge needed for the search to roll out.
  3. Moreover, the domain specification and search space will grow non-linearly with the complexity of the environment.

Is it possible to decouple high-level language planning from low-level perceptual planning?

It may be feasible to an extent, however we believe that a clean decoupling might not be “all we need”. For instance, imagine an agent being stuck at an action that needs to be resolved at semantic level of reasoning, and probably very hard for the visual module to figure out. For instance, while placing a dish on an oven tray, the robot may need to pull the dish rack out of the oven to be successful in the task.

What are the kinds of failures that can happen with ProgPrompt-like 2 stage decoupled pipeline?

A few broad failure categories could be:

  1. Generation of a semantically wrong action.
  2. Robot might fail to execute the action at perception/action/skill level.
  3. Robot needs to recover from a failure by taking a different high-level action, i.e., a precondition needs to be satisfied. The challenge is to identify that precondition from the current state of the environment and the agent.

BibTeX

@INPROCEEDINGS{10161317,
  author={Singh, Ishika and Blukis, Valts and Mousavian, Arsalan and Goyal, Ankit and Xu, Danfei and Tremblay, Jonathan and Fox, Dieter and Thomason, Jesse and Garg, Animesh},
  booktitle={2023 IEEE International Conference on Robotics and Automation (ICRA)}, 
  title={ProgPrompt: Generating Situated Robot Task Plans using Large Language Models}, 
  year={2023},
  volume={},
  number={},
  pages={11523-11530},
  doi={10.1109/ICRA48891.2023.10161317}}