Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding 🛋

CVPR 2024

1The Future Network of Intelligence Institute, The Chinese University of Hong Kong (Shenzhen) 2School of Science and Engineering, The Chinese University of Hong Kong (Shenzhen) 3IHPC, A*STAR, Singapore
4The University of Hong Kong

Comparative overview of two 3DVG approaches.

(a) Supervised 3DVG involves input from 3D scans combined with text queries, guided by object-text pair annotations, (b) Zero-shot 3DVG identifies the location of target objects using programmatic representation generated by LLMs, i.e., target category, anchor category, and relation grounding, thereby highlighting its superiority in decoding spatial relations and object identifiers within a given space, e.g., the location of the keyboard (outlined in green) can be retrieved based on the distance between the keyboard and the door (outlined in blue).


3D Visual Grounding (3DVG) aims at localizing 3D object based on textual descriptions. Conventional supervised methods for 3DVG often necessitate extensive annotations and a predefined vocabulary, which can be restrictive. To address this issue, we propose a novel visual programming approach for zero-shot open-vocabulary 3DVG, leveraging the capabilities of large language models (LLMs). Our approach begins with a unique dialog-based method, engaging with LLMs to establish a foundational understanding of zero-shot 3DVG. Building on this, we design a visual program that consists of three types of modules, i.e., view-independent, view-dependent, and functional modules. Furthermore, we develop an innovative language-object correlation module to extend the scope of existing 3D object detectors into open-vocabulary scenarios. Extensive experiments demonstrate that our zero-shot approach can outperform some supervised baselines, marking a significant stride towards effective 3DVG.


Overview of two zero-shot approaches for 3DVG.

(a) shows the working mechanism of the vanilla dialog with LLM approach. First, we describe the 3DVG task and provide the text descriptions of the room. Then, LLMs identify the objects relevant to the query sentence and perform human-like reasoning.

(b) presents the 3D visual programming approach. We first input in-context examples into LLMs. Then, LLMs generate 3D visual programs through the grounding descriptions and perform human-like reasoning. Next, these programs are transformed into executable Python codes via the LOC module for predicting the location of the object. For example, the upper example uses the view-independent module, i.e., CLOSEST to determine the proximity in 3D space, while the lower example applies the view-dependent module, i.e., RIGHT to establish the relative positioning.


3DVG results on ScanRefer validation set.

The accuracy on the "unique" subset, "multiple" subset, and whole validation set are all provided. Following ScanRefer, we label the scene as "unique" if it only contains a single object of its class. Otherwise, we label it as "multiple".