Retri3D: 3D Neural Graphics Representation Retrieval

ICLR 2025 Spotlight

Yushi Guan, Daniel Kwan, Jean Sebastien Dandurand, Xi Yan, Ruofan Liang, Yuxuan Zhang, Nilesh Jain, Nilesh Ahuja, Selvakumar Panneer, Nandita Vijaykumar
Embarc Logo
UT Logo
Intel Logo
This work presents Retri3D, a framework that enables accurate and efficient retrieval of 3D scenes represented as NGRs from large data stores using text queries. We introduce a novel Neural Field Artifact Analysis technique, combined with a Smart Camera Movement Module, to select clean views and navigate pre-trained 3DNGRs.

Abstract

Learnable 3D Neural Graphics Representations (3DNGR) have emerged as promising 3D representations for reconstructing 3D scenes from 2D images. Numerous works, including Neural Radiance Fields (NeRF), 3D Gaussian Splatting (3DGS), and their variants, have significantly enhanced the quality of these representations. The ease of construction from 2D images, suitability for online viewing/sharing, and applications in game/art design downstream tasks make it a vital 3D representation, with potential creation of large numbers of such 3D models. This necessitates large data stores, local or online, to save 3D visual data in these formats. However, no existing framework enables accurate retrieval of stored 3DNGRs. In this work, we propose Retri3D, a framework that enables accurate and efficient retrieval of 3D scenes represented as NGRs from large data stores using text queries. We introduce a novel Neural Field Artifact Analysis technique, combined with a Smart Camera Movement Module, to select clean views and navigate pre-trained 3DNGRs. These techniques enable accurate retrieval by selecting the best viewing directions in the 3D scene for high-quality visual feature embeddings. We demonstrate that Retri3D is compatible with any NGR representation. On the LERF and ScanNet++ datasets, we show significant improvement in retrieval accuracy compared to existing techniques, while being orders of magnitude faster and storage efficient.

Overview

Overview
Overview of Retri3D: (a) Retri3D enables the storage and retrieval of any neural graphics representation (NGR). (b) We perform noise analysis and viewpoint selection to select high-quality artifact-free views. (c) Visual embeddings for retrieval are generated using a pre-trained Vision-Language Model (VLM). (d) Both NGRs and visual embeddings are stored in the database. (e) Given a user query, we use the same VLM to generate a corresponding text embedding. (f) Retrieval of the relevant scene is based on the highest cosine similarity between the text and visual embeddings, returning the NGR and optionally, a rendered image, to the user.

Results

In this section, we present the visualization results of our retrievals. We showcase successful retrievals from various LERF scenes and the retrieved scene given LLaVA captions as queries.

Successful Retrievals from LERF Dataset with Object Label Queries

Big White Crinkly Flower
big white crinkly flower
Checkerboard Pattern
checkerboard pattern
Espresso Machine
espresso machine
Breakfast Sandwich Match 1
breakfast sandwich (Match 1)
Breakfast Sandwich Match 2
breakfast sandwich (Match 2)
Breakfast Sandwich Match 3
breakfast sandwich (Match 3)

Successful Retrievals with LLAVA Generated Sentence Captions

Bouquet LLAVA
Bouquet LLAVA
Bouquet LLAVA Retrieved
Bouquet LLAVA Retrieved
(Success) Query: A kitchen with a white piano, a table with a vase of flowers, and a chair.
Donuts LLAVA
Donuts LLAVA
Donuts LLAVA Retrieved
Donuts LLAVA Retrieved
(Success) Query: A table with a sandwich and a cup of coffee on it, along with a box of donuts.
Left is the image that generated the LLaVA caption. Right is the top match image from the retrieved scene. The blue mask over the entire image indicates the whole image embedding is matched.

Videos of Rendered Trajectories

We perform noise analysis and viewpoint selection to select high-quality artifact-free views.

Comparison

Retrieval Accuracy for Splatfacto Model on LERF Dataset

Object Label LLaVA Caption
# Img. Viewpoint Viewpoint
Training Smart Random Training Smart Random
1 57.89 30.08 (-27.81) 17.29 (-40.60) 45.12 32.97 (-12.15) 15.28 (-29.84)
5 68.42 64.66 (-3.76) 18.80 (-49.62) 55.07 57.17 (+2.10) 23.58 (-31.49)
10 78.95 67.67 (-11.28) 24.81 (-54.14) 67.07 61.23 (-5.84) 28.92 (-38.15)
20 83.46 73.68 (-9.78) 34.59 (-48.87) 72.41 63.25 (-9.16) 33.01 (-39.40)
50 84.21 77.69 (-6.52) 49.62 (-34.59) 71.80 65.82 (-5.98) 35.58 (-36.22)
100 84.95 80.02 (-4.93) 57.89 (-27.06) 73.67 70.48 (-3.19) 43.04 (-30.63)

Scene Coverage Statistics

Training Smart Random
# Img. %Grid w/θ %Grid w/θ %Train w/θ %Grid w/θ %Train w/θ
0 9.9 9.9 9.9 9.9 0.0 0.0 9.9 9.9 0.0 0.0
1 13.0 11.4 15.3 12.7 11.3 8.3 16.2 14.3 12.7 9.3
5 14.3 12.2 32.8 16.5 42.7 23.6 35.6 19.6 29.5 14.4
10 17.2 13.9 44.2 24.9 65.0 46.0 53.7 32.4 49.4 24.5
20 20.6 15.8 56.4 29.8 83.3 58.1 61.7 41.5 64.7 39.2
50 41.6 17.5 71.1 46.9 87.5 69.3 78.2 54.7 73.7 65.0
100 46.7 30.7 79.3 53.3 87.9 79.4 83.6 61.0 85.6 78.6

Speed and Storage Comparison with 50 Rendered Images for a LERF Scene

Ours Baselines
Stage Action or Storage SplatFacto Nerfacto LangSplat LERF
NGR
Training
Train Time (min) 6.82 8.24 90.5 40.1
Model Size (MB) 478.49 176.02 958.67 1282.4
NGR
Analysis
Generate Visual Emb. (s) 17.25 19.23 152.5 53.6
Database
& Retrieval
Embedding Size 20.58MB 20.58MB 18.78GB 225.4GB
Retrieval Time (s) 5e-5 5e-5 1e-3 17

Retrieval Accuracy for Splatfacto Model with Object Labels for ScanNet++

# Img. Training Smart Random
10 41.62 39.63 (-1.99) 18.39 (-23.23)
20 50.06 47.34 (-2.72) 23.58 (-26.48)
50 58.23 54.39 (-3.84) 27.93 (-30.30)
100 65.03 64.18 (-0.85) 29.04 (-35.99)

Noise Analysis

Noise Analysis Training and Inference Process

Overview
Noise analysis training and inference process. (a) We generate noisy images, which can be done simply through rendering random viewpoints in pre-trained NGR scenes. Some content can be noise-free, but they constitute only small portions of the images. (b) Using a pre-trained VLM's vision encoder, we generate the pixel-wise activation features. (c) We train a Multivariate Gaussian distribution to represent the noise features. (d) During inference, we calculate Mahalanobis distance of a RGB rendering's activations to the trained noise Gaussian to produce a noise map.

Application of Proposed Noise Analysis Method

RGB Clean Score Next Viewpoint Selection
Step 1
Step 1 RGB
Step 1 Clean Score
Step 1 Next Viewpoint Selection
Step 2
Step 2 RGB
Step 2 Clean Score
Step 2 Next Viewpoint Selection
Application of our noise analysis method. (a) Starting from a random viewpoint rendering; (b) Noise analysis creates a clean score map highlighting clean (bright) and noisy (dark) regions; (c) Based on the noise analysis, the best next viewpoint is selected (visualized as a green box); (d) The rendering from the new viewpoint shows reduced noise. This process can be iterated multiple times.

Citation


  @inproceedings{guan2025retri3d,
    title={Retri3D: 3D Neural Graphics Representation Retrieval},
    author={Yushi Guan, Daniel Kwan, Jean Sebastien Dandurand, Xi Yan, Ruofan Liang, Yuxuan Zhang, Nilesh Jain, Nilesh Ahuja, Selvakumar Panneer, Nandita Vijaykumar},
    booktitle={International Conference on Learning Representations},
    year={2025}
  }