Scenes as Objects, Not Primitives: Instance-Structured 3D Tokenization from Unposed Views

Abstract

A scene is its objects, not its primitives

A 3D scene is understood through its objects, not the primitives that compose them. Yet feed-forward reconstruction methods output dense, unstructured sets of points or Gaussians, leaving object-level structure to be recovered after the fact. We propose a feed-forward framework that decomposes a scene into instance-structured 3D token groups directly from unposed multi-view images — compact object-centric units from which reconstruction, segmentation, and manipulation all follow. Each token group pairs an instance token capturing entity-level identity with anchor tokens that encode local geometry and appearance, which are decoded into a set of 3D Gaussians. This two-level factorization decouples object identity from local appearance, making object instances a native interface of the representation rather than a derived product. The token groups are learned through differentiable rendering with joint reconstruction and segmentation supervision, requiring no 3D annotations. Our feed-forward model surpasses per-scene optimization baselines in class-agnostic instance segmentation while remaining competitive in novel view synthesis. Beyond these metrics, the same token groups directly unlock instance-level scene editing — removing, translating, or inserting objects by operating on their groups — as well as efficient open-vocabulary 3D instance retrieval, where retrieval complexity scales with the number of instances rather than primitives.

Method

Instance-structured token groups

A frozen 3D foundation model extracts multi-view features and pointmaps; a token-group decoder produces anchor tokens for local detail and group tokens that bind anchors into object instances.

Figure 2. Overview of the 3D token-group framework: multi-view features and pointmaps are fused into context tokens, the image-anchor decoder decodes anchor tokens, and the anchor-grouping decoder produces group tokens that define instance-level assignments.

Multi-view feature encoding

A frozen 3D foundation model (VGGT) extracts per-view features and pointmaps from unposed RGB images; patchified RGB and pointmap cues enrich the context tokens.

Token-group decoding

An image-anchor transformer grounds anchor tokens in the multi-view context, then an anchor-grouping transformer updates group tokens that compete for anchor ownership via a softmax assignment — analogous to slot competition.

Gaussian reconstruction & grouping

Each anchor decodes into 3D Gaussians; each Gaussian inherits its parent's group assignment, yielding instance-level groups that can be independently rendered and manipulated.

Joint 2D supervision

Trained end-to-end with RGB rendering loss + 2D instance-mask grouping loss (Hungarian-matched BCE + Dice). No 3D annotations; instance structure emerges entirely from 2D supervision.

Decomposed feature distillation

2D foundation features are lifted into a shared group-level embedding plus low-dimensional anchor-level residuals — compact semantics that support entity-level open-vocabulary retrieval.

Applications

One representation, four interfaces

Because instances are first-class units, downstream tasks reduce to operating on token groups — no post-hoc grouping or per-scene optimization.

🪟

Novel-View Synthesis

Anchor tokens decode into 3D Gaussians, rendering high-quality novel views in a single forward pass.

🧩

3D Instance Segmentation

Group tokens yield class-agnostic instance masks that surpass per-scene optimization baselines on AP.

✦

Token Manipulation

Select a token group and apply an elementary op — render, remove, move, or insert an instance — with no masks or optimization.

🔍

Open-Vocab Retrieval

Lifted group features let a text query select complete, coherent instances — not a scattered subset of primitives.

Figure 6. Instance-level token manipulation. Each token group represents a single instance, so edits — including object insertion — stay strictly localized to the targeted instance, leaving neighbors and background untouched.

Interactive

Toggle the tokens, compose the scene

The actual 3D Gaussians our feed-forward model predicts for a ScanNet test scene, rendered as real Gaussian splats. Drag to orbit, scroll to zoom — and switch a token off to make exactly that instance disappear.

ScanNet · scene0694_01 drag to orbit ⟲ loading 3D splats…

Instance token groups

11 token groups · drag the scene to rotate · click to toggle each instance

32k Gaussians from a single forward pass, grouped into instance token groups — the same interface used for removal, transformation, and insertion. (No WebGL? Falls back to a 2.5D parallax preview.)

Results

Compact units, strong numbers

Evaluated on ScanNet across reconstruction, feature lifting, and class-agnostic instance segmentation. Bold = best.

Table 1 · Reconstruction & feature lifting (2 context views)

Method	Src mIoU↑	Tgt mIoU↑	PSNR↑	SSIM↑	LPIPS↓	#Sem. units↓	Feat. size↓
LSM	0.527	0.512	24.24	0.821	0.222	131,072	67.1 M
Uni3R	0.540	0.558	25.53	0.873	0.138	131,072	8.4 M
C3G	0.542	0.513	23.89	0.770	0.285	2,048	1.0 M
Ours	0.661	0.657	25.28	0.771	0.238	<100	59.4 K

Best feature-lifting mIoU on both source and target views, while storing semantics in <100 instance tokens instead of 131K per-pixel Gaussians — orders of magnitude more compact.

Figure 3. Qualitative reconstruction with 2 context views — our token groups recover geometry and appearance competitively with per-pixel Gaussian baselines.

Figure 4. Open-vocabulary novel-view segmentation with lifted LSeg features — entity-level token features yield clean, coherent semantic maps.

Table 2 · Class-agnostic instance segmentation (8 context views)

Type	Method	AP↑	AP₅₀↑	AP₂₅↑	PSNR↑	SSIM↑	LPIPS↓
Per-scene opt.	Gaussian Grouping	0.139	0.288	0.440	23.20	0.715	0.325
Per-scene opt.	ObjectGS	0.178	0.337	0.489	24.34	0.733	0.310
FF + opt.	IGGT + LUDVIG	0.122	0.265	0.442	22.75	0.712	0.323
Feed-forward	Ours	0.235	0.438	0.564	22.41	0.709	0.355

Best AP across all thresholds in a fully feed-forward manner, while competing per-scene methods require costly per-scene optimization.

Figure 5. Class-agnostic instance segmentation — group tokens partition the scene into coherent instances feed-forward, without per-scene optimization.

A scene is its objects, not its primitives

Instance-structured token groups

Multi-view feature encoding

Token-group decoding

Gaussian reconstruction & grouping

Joint 2D supervision

Decomposed feature distillation

One representation, four interfaces

Novel-View Synthesis

3D Instance Segmentation

Token Manipulation

Open-Vocab Retrieval

Toggle the tokens, compose the scene

Instance token groups

Compact units, strong numbers

BibTeX