Inner Monologue: Embodied Reasoning through Planning with Language Models
Abstract
Recent works have shown how the reasoning capabilities of Large Language Models (LLMs) can be applied to domains beyond natural language processing, such as planning and interaction for robots. These embodied problems require an agent to understand many semantic aspects of the world: the repertoire of skills available, how these skills influence the world, and how changes to the world map back to the language. LLMs planning in embodied environments need to consider not just what skills to do, but also how and when to do them - answers that change over time in response to the agent’s own choices. In this work, we investigate to what extent LLMs used in such embodied contexts can reason over sources of feedback provided through natural language, without any additional training. We propose that by leveraging environment feedback, LLMs are able to form an inner monologue that allows them to more richly process and plan in robotic control scenarios. We investigate a variety of sources of feedback, such as success detection, scene description, and human interaction. We find that closed-loop language feedback significantly improves high level instruction completion on three domains, including simulated and real table top rearrangement tasks and long-horizon mobile manipulation tasks in a kitchen environment.
Video Walkthrough
Approach
We formulate an “inner monologue” by continually substituting information from the various sources of feedback into the LLM planning language prompts as the robot interacts with the environment. While LLMs have demonstrated exceptional planning capabilities for embodied control tasks, prior works have found it crucial to ground LLM predictions with external components such as affordance functions in order to produce useful plans that are executable by robots. However, LLMs used in this context have thus far remained one-directional – providing a list of skills, without making corrections or leveraging opportunities to re-plan accordingly. In contrast, Inner Monologue studies settings where grounded environment feedback is provided directly to the LLM in a closed-loop fashion. This promotes improved LLM reasoning in complex long-horizon settings, even before any external affordance-based grounding methods are applied.
Results
In order to study how different sources of environment feedback can support a rich inner monologue that enables complex robotic control, we analyze diverse long-horizon manipulation and navigation tasks in simulation and in the real world. As Inner Monologue is not dependent on a specific LLM or a type of grounding feedback, we study different Inner Monologue implementations in three environments with different LLM planning methods and different sources of feedback from the environment. For more details about experiments, implementations, and the prompts used for LLM for each domain, please refer to the paper and the appendix.
Simulated Tabletop Rearrangement
Given an unseen task instruction, we show that LLM not only can generate sensible action plans as observed in previous works, but they can also incorporate injected texual feedback of success detection and passive scene description. The video below shows one instantiation of using passive scene description as feedback (Scene). Specifically, LLM first infers desired sub-tasks given the high-level instruction. Then the scene description keeps track of the achieved sub-tasks after each step. Additionally, LLM also generates chain-of-thought text about what remains to be achieved after each step. We demonstrate this can elicit complex replanning behaviors in tasks that require combinatorial state space (e.g., "put all blocks in bowls with matching colors", "stack all the blocks").
Real-World Tabletop Rearrangement
We demonstrate another implementation of Inner Monologue in a real-world tabletop environment, where perceptual models may be subject to occlusions. We leverage passive scene description (implemented as object recognition) and success detection feedbacks (Object + Success). Given the list of visible objects and past interactions, we prompt LLM to reason about occluded objects and achieved sub-goals. We show this enables Inner Monologue to complete tasks like "stack all the blocks" and "put bottles and fruits in different plates", even under considerable perturbations to the primitive policy.
Real-World Mobile Manipulation
The method is also amenable to complex realistic household tasks given wider range of skills other than pick-and-place. In the video below, we leverage success detection feedback (Success). Although natural failures are already prone to occur in such settings, we use adversarial human interventions to force policy failures in order to demonstrate the replanning capability of Inner Monologue. We show that language model can effectively replan if the previous step has failed. This allows the robot to recover from failures and complete complex tasks like "put a coke in the top drawer", as shown in the video below.
Emergent Capabilities
Due to the versatility of LLM, we also observe interesting emergent behaviors when we incorporate LLM-informed embodied feedbacks. With appropriate prompt structure, LLM can be made to optionally ask questions after each step and humans may provide any free-form answers (i.e., Human feedback). However, instead of directly answering the questions by LLM, we find that the human operator can even interrupt the current task, re-command the robot to perform a different task, or even ask the robot to return to a previous task. Importantly, none of these behaviors are shown in the few-shot prompt for LLM. Yet because of its ability to condition on free-form text, LLM can naturally incorporate the questions and the answers to generate correct and grounded future actions. See the video below for an example.