Scenes as Objects, Not Primitives:
Instance-Structured 3D Tokenization from Unposed Views

A feed-forward framework that decomposes a scene into compact, object-centric 3D token groups — making instances a native interface of the 3D representation.

Mijin Yoo1*   In Cho1*   Subin Jeon2   Jiwoo Lee1   Eunbyung Park1   Seon Joo Kim1†
1Yonsei University   2Seoul National University
*Equal contribution  Corresponding author
Paper arXiv Code BibTeX
Teaser

Figure 1. Our model maps unposed multi-view images to instance-structured 3D token groups, which make instances a native interface of the representation — supporting novel-view synthesis, 3D instance segmentation, instance-level manipulation, and open-vocabulary retrieval.

Abstract

A scene is its objects, not its primitives

A 3D scene is understood through its objects, not the primitives that compose them. Yet feed-forward reconstruction methods output dense, unstructured sets of points or Gaussians, leaving object-level structure to be recovered after the fact. We propose a feed-forward framework that decomposes a scene into instance-structured 3D token groups directly from unposed multi-view images — compact object-centric units from which reconstruction, segmentation, and manipulation all follow. Each token group pairs an instance token capturing entity-level identity with anchor tokens that encode local geometry and appearance, which are decoded into a set of 3D Gaussians. This two-level factorization decouples object identity from local appearance, making object instances a native interface of the representation rather than a derived product. The token groups are learned through differentiable rendering with joint reconstruction and segmentation supervision, requiring no 3D annotations. Our feed-forward model surpasses per-scene optimization baselines in class-agnostic instance segmentation while remaining competitive in novel view synthesis. Beyond these metrics, the same token groups directly unlock instance-level scene editing — removing, translating, or inserting objects by operating on their groups — as well as efficient open-vocabulary 3D instance retrieval, where retrieval complexity scales with the number of instances rather than primitives.

Method

Instance-structured token groups

A frozen 3D foundation model extracts multi-view features and pointmaps; a token-group decoder produces anchor tokens for local detail and group tokens that bind anchors into object instances.

Method overview

Figure 2. Overview of the 3D token-group framework: multi-view features and pointmaps are fused into context tokens, the image-anchor decoder decodes anchor tokens, and the anchor-grouping decoder produces group tokens that define instance-level assignments.

1

Multi-view feature encoding

A frozen 3D foundation model (VGGT) extracts per-view features and pointmaps from unposed RGB images; patchified RGB and pointmap cues enrich the context tokens.

2

Token-group decoding

An image-anchor transformer grounds anchor tokens in the multi-view context, then an anchor-grouping transformer updates group tokens that compete for anchor ownership via a softmax assignment — analogous to slot competition.

3

Gaussian reconstruction & grouping

Each anchor decodes into 3D Gaussians; each Gaussian inherits its parent's group assignment, yielding instance-level groups that can be independently rendered and manipulated.

4

Joint 2D supervision

Trained end-to-end with RGB rendering loss + 2D instance-mask grouping loss (Hungarian-matched BCE + Dice). No 3D annotations; instance structure emerges entirely from 2D supervision.

5

Decomposed feature distillation

2D foundation features are lifted into a shared group-level embedding plus low-dimensional anchor-level residuals — compact semantics that support entity-level open-vocabulary retrieval.

Applications

One representation, four interfaces

Because instances are first-class units, downstream tasks reduce to operating on token groups — no post-hoc grouping or per-scene optimization.

🪟

Novel-View Synthesis

Anchor tokens decode into 3D Gaussians, rendering high-quality novel views in a single forward pass.

🧩

3D Instance Segmentation

Group tokens yield class-agnostic instance masks that surpass per-scene optimization baselines on AP.

Token Manipulation

Select a token group and apply an elementary op — render, remove, move, or insert an instance — with no masks or optimization.

🔍

Open-Vocab Retrieval

Lifted group features let a text query select complete, coherent instances — not a scattered subset of primitives.

Token manipulation

Figure 6. Instance-level token manipulation. Each token group represents a single instance, so edits — including object insertion — stay strictly localized to the targeted instance, leaving neighbors and background untouched.

Interactive

Toggle the tokens, compose the scene

The actual 3D Gaussians our feed-forward model predicts for a ScanNet test scene, rendered as real Gaussian splats. Drag to orbit, scroll to zoom — and switch a token off to make exactly that instance disappear.

ScanNet · scene0694_01 drag to orbit ⟲ loading 3D splats…

Instance token groups

11 token groups · drag the scene to rotate · click to toggle each instance

32k Gaussians from a single forward pass, grouped into instance token groups — the same interface used for removal, transformation, and insertion. (No WebGL? Falls back to a 2.5D parallax preview.)

Results

Compact units, strong numbers

Evaluated on ScanNet across reconstruction, feature lifting, and class-agnostic instance segmentation. Bold = best.

Table 1 · Reconstruction & feature lifting (2 context views)
MethodSrc mIoU↑Tgt mIoU↑PSNR↑SSIM↑LPIPS↓#Sem. units↓Feat. size↓
LSM0.5270.51224.240.8210.222131,07267.1 M
Uni3R0.5400.55825.530.8730.138131,0728.4 M
C3G0.5420.51323.890.7700.2852,0481.0 M
Ours0.6610.65725.280.7710.238<10059.4 K

Best feature-lifting mIoU on both source and target views, while storing semantics in <100 instance tokens instead of 131K per-pixel Gaussians — orders of magnitude more compact.

Qualitative reconstruction

Figure 3. Qualitative reconstruction with 2 context views — our token groups recover geometry and appearance competitively with per-pixel Gaussian baselines.

Open-vocabulary segmentation

Figure 4. Open-vocabulary novel-view segmentation with lifted LSeg features — entity-level token features yield clean, coherent semantic maps.

Table 2 · Class-agnostic instance segmentation (8 context views)
TypeMethodAP↑AP₅₀↑AP₂₅↑PSNR↑SSIM↑LPIPS↓
Per-scene opt.Gaussian Grouping0.1390.2880.44023.200.7150.325
Per-scene opt.ObjectGS0.1780.3370.48924.340.7330.310
FF + opt.IGGT + LUDVIG0.1220.2650.44222.750.7120.323
Feed-forwardOurs0.2350.4380.56422.410.7090.355

Best AP across all thresholds in a fully feed-forward manner, while competing per-scene methods require costly per-scene optimization.

Class-agnostic instance segmentation

Figure 5. Class-agnostic instance segmentation — group tokens partition the scene into coherent instances feed-forward, without per-scene optimization.

Citation

BibTeX

@inproceedings{yoo2026scenes,
  title     = {Scenes as Objects, Not Primitives: Instance-Structured
               3D Tokenization from Unposed Views},
  author    = {Yoo, Mijin and Cho, In and Jeon, Subin and Lee, Jiwoo
               and Park, Eunbyung and Kim, Seon Joo},
  booktitle = {Preprint},
  year      = {2026}
}