Task planning can require defining myriad domain knowledge about the world in which a robot needs to act. To ameliorate that effort, large language models (LLMs) can be used to score potential next actions during task planning, and even generate action sequences directly, given an instruction in natural language with no additional domain information. However, such methods either require enumerating all possible next steps for scoring, or generate free-form text that may contain actions not possible on a given robot in its current context. We present a programmatic LLM prompt structure that enables plan generation functional across situated environments, robot capabilities, and tasks. Our key insight is to prompt the LLM with program-like specifications of the available actions and objects in an environment, as well as with example programs that can be executed. We make concrete recommendations about prompt structure and generation constraints through ablation experiments, demonstrate state of the art success rates in VirtualHome household tasks, and deploy our method on a physical robot arm for tabletop tasks.
We introduce a prompting method that goes beyond conditioning LLMs in natural language, utilizing programming language structures, leveraging the fact that LLMs are trained on several open-source codebases. ProgPrompt provides an LLM with a pythonic program header that imports available actions and their arguments, shows a list of environment objects, followed by multiple example task plans, formatted as pythonic functions. The function name is the task specification, and the function implementation is an example task plan. The plan consists of comments, actions, and assertions. We use comments to group multiple high-level actions together, similar to chain-of-thought reasoning. Actions are expressed as imported function calls. Assertions check for action pre-conditions, and trigger recovery actions. Finally we append an incomplete function definition for the LLM to complete. The plan is interpreted by executing actions in the env and asserting preconditions using the LLM.
bring coffeepot and cupcake to the coffee table | brush teeth | eat chips on the sofa |
---|---|---|
make toast | put salmon in the fridge | throw away apple |
turn off light | wash the plate. | watch tv |
How does this approach compare with end-to-end robot learning models, and what are the current limitations?
ProgPrompt is a hierarchical solution to task planning where the abstract task descriptions leverage LLM’s reasoning and maps the task plan to the grounded environment labels. On the other hand, in end-to-end approaches, generally the model implicitly learns reasoning, planning, and grounding, while mapping the abstract task description to the action space directly.
Pros:
How does it compare with the concurrent work: Code-as-Policies (CaP)?
During ``PROMPT for State Feedback”, it seems that the prompt already includes all information about the environment state. Is it necessary to prompt the LLM again for the assertion (compared to a simple rule-based algorithm)?
Is it possible that the generated code will lead the robot to be stuck in an infinite loop?
LLM code generation could lead to loops by predicting the same actions repeatedly as a generation artifact. LLMs used to suffer from such degeneration, but with latest LLMs (i.e. GPT-3) we have not encountered it at all.
Why are real-robot experiments simpler than virtual experiments?
The real-robot experiments were done as a demonstration of the approach on a real-robot, while studying the method in depth in a virtual simulator, for the sake of simplicity and efficiency.
What’s the difference between various GPT3 model versions used in this project?
We name 'GPT3', which is the latest available version of GPT3 model on OpenAI at the time the paper was written: 'text-davinci-002'. We name 'davinci' as the original version of GPT3 released: 'text-davinci'. More info on GPT3 models variations and naming can be found here.
Why not a planning language like PDDL (or other planning languages) be used to construct ProgPrompt? Any advantages of using a pythonic structure?
How to handle multiple instances of the same object type in the scene?
ProgPrompt doesn’t tackle the issue, however, Translating Natural Language to Planning Goals with Large-Language Models shows that multiple instances of the same objects can be handled by using labels with object IDs such as 'book_1, book_2'.
Why doesn't the paper compare the performance of the proposed method to InnerMonologue, SAYCAN, or Socratic models?
At the time of writing, the dataset or model from the above papers were not public. However, we do compare with a proxy approach, similar in underlying idea to the above approaches, in the VirtualHome environment. 'LangPlan' in our baselines, uses GPT3 to get textual plan steps, which are then executed using a GPT-2 based trained policy.
So the next step in this direction of research is to create highly structured inputs and outputs that could be compiled, since eventually we want something that compiles on robotic machines?
The disconnect and information bottleneck between LLM planning module and skill execution module might make it less concrete on “how much” and “what” information should be passed through the LLM during planning. That said, we think that this would be an interesting direction to pursue and test the limits of LLM’s highly structured input understanding and generation.
How does it compare to a classical planner?
Is it possible to decouple high-level language planning from low-level perceptual planning?
It may be feasible to an extent, however we believe that a clean decoupling might not be “all we need”. For instance, imagine an agent being stuck at an action that needs to be resolved at semantic level of reasoning, and probably very hard for the visual module to figure out. For instance, while placing a dish on an oven tray, the robot may need to pull the dish rack out of the oven to be successful in the task.
What are the kinds of failures that can happen with ProgPrompt-like 2 stage decoupled pipeline?
A few broad failure categories could be:
@INPROCEEDINGS{10161317,
author={Singh, Ishika and Blukis, Valts and Mousavian, Arsalan and Goyal, Ankit and Xu, Danfei and Tremblay, Jonathan and Fox, Dieter and Thomason, Jesse and Garg, Animesh},
booktitle={2023 IEEE International Conference on Robotics and Automation (ICRA)},
title={ProgPrompt: Generating Situated Robot Task Plans using Large Language Models},
year={2023},
volume={},
number={},
pages={11523-11530},
doi={10.1109/ICRA48891.2023.10161317}}