arxiv:2412.18450

3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding

Published on Dec 24

· Submitted by

wingrune on Dec 25

#1 Paper of the day

Upvote

Authors:

Tatiana Zemskova ,

Abstract

A 3D scene graph represents a compact scene model, storing information about the objects and the semantic relationships between them, making its use promising for robotic tasks. When interacting with a user, an embodied intelligent agent should be capable of responding to various queries about the scene formulated in natural language. Large Language Models (LLMs) are beneficial solutions for user-robot interaction due to their natural language understanding and reasoning abilities. Recent methods for creating learnable representations of 3D scenes have demonstrated the potential to improve the quality of LLMs responses by adapting to the 3D world. However, the existing methods do not explicitly utilize information about the semantic relationships between objects, limiting themselves to information about their coordinates. In this work, we propose a method 3DGraphLLM for constructing a learnable representation of a 3D scene graph. The learnable representation is used as input for LLMs to perform 3D vision-language tasks. In our experiments on popular ScanRefer, RIORefer, Multi3DRefer, ScanQA, Sqa3D, and Scan2cap datasets, we demonstrate the advantage of this approach over baseline methods that do not use information about the semantic relationships between objects. The code is publicly available at https://github.com/CognitiveAISystems/3DGraphLLM.

View arXiv page View PDF Add to collection

Community

wingrune

Paper author Paper submitter about 20 hours ago

•

edited about 20 hours ago

• We introduce 3DGraphLLM, the first method to create a learnable 3D scene graph repre- sentation for LLMs, enabling the mapping of semantic relationships between objects in the scene to LLM’s token embedding space.
• We propose an algorithm that produces a flat sequence of graph embedding tokens using k-nearest neighbor selection with a minimum distance filter between objects, optimizing inference speed by reducing the number of tokens required to describe the scene.
• 3DGraphLLM shows state-of-the-art results for the 3D referred object grounding task on the Multi3DRefer (+5.8% [email protected]) and ScanRefer (+4.4% [email protected]) benchmarks and also for the 3D scene captioning on the Scan2Cap dataset ([email protected] +5.8%).

The code is publicly available at https://github.com/CognitiveAISystems/3DGraphLLM. The model is available on HuggingFace Hub: https://huggingface.co./wingrune/3DGraphLLM.

librarian-bot

about 3 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2412.18450 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2412.18450 in a Space README.md to link it from this page.