3D Visual Grounding (3DVG) aims at localizing 3D object based on textual descriptions. Conventional supervised methods for 3DVG often necessitate extensive annotations and a predefined vocabulary, which can be restrictive. To address this issue, we propose a novel visual programming approach for zero-shot open-vocabulary 3DVG, leveraging the capabilities of large language models (LLMs). Our approach begins with a unique dialog-based method, engaging with LLMs to establish a foundational understanding of zero-shot 3DVG. Building on this, we design a visual program that consists of three types of modules, i.e., view-independent, view-dependent, and functional modules. Furthermore, we develop an innovative language-object correlation module to extend the scope of existing 3D object detectors into open-vocabulary scenarios. Extensive experiments demonstrate that our zero-shot approach can outperform some supervised baselines, marking a significant stride towards effective 3DVG.
(a) shows the working mechanism of the vanilla dialog with LLM approach. First, we describe the 3DVG task and provide the text descriptions of the room. Then, LLMs identify the objects relevant to the query sentence and perform human-like reasoning.
(b) presents the 3D visual programming approach. We first input in-context examples into LLMs. Then, LLMs generate 3D visual programs through the grounding descriptions and perform human-like reasoning. Next, these programs are transformed into executable Python codes via the LOC module for predicting the location of the object. For example, the upper example uses the view-independent module, i.e., CLOSEST to determine the proximity in 3D space, while the lower example applies the view-dependent module, i.e., RIGHT to establish the relative positioning.
The accuracy on the "unique" subset, "multiple" subset, and whole validation set are all provided. Following ScanRefer, we label the scene as "unique" if it only contains a single object of its class. Otherwise, we label it as "multiple".