Retri3D

Abstract

Learnable 3D Neural Graphics Representations (3DNGR) have emerged as promising 3D representations for reconstructing 3D scenes from 2D images. Numerous works, including Neural Radiance Fields (NeRF), 3D Gaussian Splatting (3DGS), and their variants, have significantly enhanced the quality of these representations. The ease of construction from 2D images, suitability for online viewing/sharing, and applications in game/art design downstream tasks make it a vital 3D representation, with potential creation of large numbers of such 3D models. This necessitates large data stores, local or online, to save 3D visual data in these formats. However, no existing framework enables accurate retrieval of stored 3DNGRs. In this work, we propose Retri3D, a framework that enables accurate and efficient retrieval of 3D scenes represented as NGRs from large data stores using text queries. We introduce a novel Neural Field Artifact Analysis technique, combined with a Smart Camera Movement Module, to select clean views and navigate pre-trained 3DNGRs. These techniques enable accurate retrieval by selecting the best viewing directions in the 3D scene for high-quality visual feature embeddings. We demonstrate that Retri3D is compatible with any NGR representation. On the LERF and ScanNet++ datasets, we show significant improvement in retrieval accuracy compared to existing techniques, while being orders of magnitude faster and storage efficient.

Overview

Results

In this section, we present the visualization results of our retrievals. We showcase successful retrievals from various LERF scenes and the retrieved scene given LLaVA captions as queries.

Successful Retrievals from LERF Dataset with Object Label Queries

Big White Crinkly Flower — *big white crinkly flower*

Checkerboard Pattern — *checkerboard pattern*

Breakfast Sandwich Match 1 — *breakfast sandwich* (Match 1)

Breakfast Sandwich Match 2 — *breakfast sandwich* (Match 2)

Breakfast Sandwich Match 3 — *breakfast sandwich* (Match 3)

Successful Retrievals with LLAVA Generated Sentence Captions

(Success) Query: A kitchen with a white piano, a table with a vase of flowers, and a chair.

(Success) Query: A table with a sandwich and a cup of coffee on it, along with a box of donuts.

Left is the image that generated the LLaVA caption. Right is the top match image from the retrieved scene. The blue mask over the entire image indicates the whole image embedding is matched.

Videos of Rendered Trajectories

We perform noise analysis and viewpoint selection to select high-quality artifact-free views.

Comparison

Retrieval Accuracy for Splatfacto Model on LERF Dataset

	Object Label			LLaVA Caption
# Img.	Viewpoint			Viewpoint
# Img.	Training	Smart	Random	Training	Smart	Random
1	57.89	30.08 (-27.81)	17.29 (-40.60)	45.12	32.97 (-12.15)	15.28 (-29.84)
5	68.42	64.66 (-3.76)	18.80 (-49.62)	55.07	57.17 (+2.10)	23.58 (-31.49)
10	78.95	67.67 (-11.28)	24.81 (-54.14)	67.07	61.23 (-5.84)	28.92 (-38.15)
20	83.46	73.68 (-9.78)	34.59 (-48.87)	72.41	63.25 (-9.16)	33.01 (-39.40)
50	84.21	77.69 (-6.52)	49.62 (-34.59)	71.80	65.82 (-5.98)	35.58 (-36.22)
100	84.95	80.02 (-4.93)	57.89 (-27.06)	73.67	70.48 (-3.19)	43.04 (-30.63)

Scene Coverage Statistics

	Training		Smart				Random
# Img.	%Grid	w/θ	%Grid	w/θ	%Train	w/θ	%Grid	w/θ	%Train	w/θ
0	9.9	9.9	9.9	9.9	0.0	0.0	9.9	9.9	0.0	0.0
1	13.0	11.4	15.3	12.7	11.3	8.3	16.2	14.3	12.7	9.3
5	14.3	12.2	32.8	16.5	42.7	23.6	35.6	19.6	29.5	14.4
10	17.2	13.9	44.2	24.9	65.0	46.0	53.7	32.4	49.4	24.5
20	20.6	15.8	56.4	29.8	83.3	58.1	61.7	41.5	64.7	39.2
50	41.6	17.5	71.1	46.9	87.5	69.3	78.2	54.7	73.7	65.0
100	46.7	30.7	79.3	53.3	87.9	79.4	83.6	61.0	85.6	78.6

Speed and Storage Comparison with 50 Rendered Images for a LERF Scene

	Ours		Baselines
Stage	Action or Storage	SplatFacto	Nerfacto	LangSplat	LERF
NGR Training	Train Time (min)	6.82	8.24	90.5	40.1
NGR Training	Model Size (MB)	478.49	176.02	958.67	1282.4
NGR Analysis	Generate Visual Emb. (s)	17.25	19.23	152.5	53.6
Database & Retrieval	Embedding Size	20.58MB	20.58MB	18.78GB	225.4GB
Database & Retrieval	Retrieval Time (s)	5e^-5	5e^-5	1e^-3	17

Retrieval Accuracy for Splatfacto Model with Object Labels for ScanNet++

# Img.	Training	Smart	Random
10	41.62	39.63 (-1.99)	18.39 (-23.23)
20	50.06	47.34 (-2.72)	23.58 (-26.48)
50	58.23	54.39 (-3.84)	27.93 (-30.30)
100	65.03	64.18 (-0.85)	29.04 (-35.99)