Papers
arxiv:2410.17856

ROCKET-1: Master Open-World Interaction with Visual-Temporal Context Prompting

Published on Oct 23
· Submitted by phython96 on Oct 28
#1 Paper of the day

Abstract

Vision-language models (VLMs) have excelled in multimodal tasks, but adapting them to embodied decision-making in open-world environments presents challenges. A key issue is the difficulty in smoothly connecting individual entities in low-level observations with abstract concepts required for planning. A common approach to address this problem is through the use of hierarchical agents, where VLMs serve as high-level reasoners that break down tasks into executable sub-tasks, typically specified using language and imagined observations. However, language often fails to effectively convey spatial information, while generating future images with sufficient accuracy remains challenging. To address these limitations, we propose visual-temporal context prompting, a novel communication protocol between VLMs and policy models. This protocol leverages object segmentation from both past and present observations to guide policy-environment interactions. Using this approach, we train ROCKET-1, a low-level policy that predicts actions based on concatenated visual observations and segmentation masks, with real-time object tracking provided by SAM-2. Our method unlocks the full potential of VLMs visual-language reasoning abilities, enabling them to solve complex creative tasks, especially those heavily reliant on spatial understanding. Experiments in Minecraft demonstrate that our approach allows agents to accomplish previously unattainable tasks, highlighting the effectiveness of visual-temporal context prompting in embodied decision-making. Codes and demos will be available on the project page: https://craftjarvis.github.io/ROCKET-1.

Community

Paper author Paper submitter

🚀 Breaking New Ground in Minecraft AI! Vision-language models (VLMs) have struggled to tackle embodied decision-making in open worlds—until now. Introducing ROCKET-1 with visual-temporal context prompting! Our approach bridges low-level observations with high-level reasoning, using real-time object tracking and segmentation to achieve tasks that were previously unattainable. This breakthrough unlocks VLMs’ spatial reasoning, enabling agents to master complex, creative tasks in Minecraft. 🌍✨

Paper author Paper submitter
edited Oct 28

🚀🚀ROCKET-1 can interact with the UI in Minecraft !!

Paper author Paper submitter
edited Oct 28

🚀🚀ROCKET-1 can build some structures.

Paper author Paper submitter

🚀🚀ROCKET-1 can increase its elevation underground by digging a step.

Paper author Paper submitter

🚀🚀ROCKET-1 can hunt desired mobs!!

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2410.17856 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 4