SCENEFORGE: Enhancing 3D-text alignment with Structured Scene Compositions.
Abstract
The whole is greater than the sum of its parts—even in 3D-text contrastive learning. We introduce SCENEFORGE, a novel framework that enhances contrastive alignment between 3D point clouds and text through structured multi-object scene compositions. SCENEFORGE leverages individual 3D shapes to construct multi-object scenes with explicit spatial relations, pairing them with coherent multi-object descriptions refined by a large language model. By augmenting contrastive training with these structured, compositional samples, SCENEFORGE effectively addresses the scarcity of large-scale 3D-text datasets, significantly enriching data complexity and diversity. Extensive experiments demonstrate that SCENEFORGE delivers substantial performance gains across multiple tasks, including zero-shot classification on ModelNet, ScanObjNN, Objaverse-LVIS, and ScanNet, as well as few-shot part segmentation on ShapeNetPart. SCENEFORGE'S compositional augmentations are model-agnostic, consistently improving performance across multiple encoder architectures. Moreover, SCENEFORGE improves 3D visual question answering on ScanQA, generalizes robustly to retrieval scenarios with increasing scene complexity, and showcases spatial reasoning capabilities by adapting spatial configurations to align precisely with textual instructions.
The Challenge: Data Scarcity in 3D-Text Learning
Scaling contrastive learning to 3D is challenging due to the limited availability of large-scale 3D-text datasets, especially when compared to the vast resources in the 2D domain. This data scarcity makes it difficult for models to learn robust and generalizable representations of the complex 3D world. Our work introduces a new strategy to synthetically enrich data complexity and diversity, pushing the boundaries of what's possible in 3D multimodal learning.
Our Approach: Structured Scene Composition
We introduce SCENEFORGE, a framework that synthetically creates complex multi-object 3D scenes from individual point clouds. By composing objects using explicit spatial relations (like 'next to' or 'over') and generating coherent, LLM-refined captions, we virtually expand the training data. This model-agnostic pipeline can be plugged into any 3D-text contrastive learning model to enhance its training with richer, more diverse samples, teaching it to understand not just objects, but the relationships between them.
Consistent Gains and Enhanced Reasoning
Our method delivers substantial performance gains across multiple tasks, including zero-shot classification, few-shot segmentation, and 3D question answering. The SF-Uni3D variant even outperforms costly ensemble methods. Qualitatively, models trained with SCENEFORGE show significantly improved spatial and compositional reasoning, successfully repositioning objects to match new textual descriptions where baseline models fail, demonstrating a deeper understanding of 3D space and language.
BibTeX
@article{sbrolli2025sceneforgeenhancing3dtextalignment,
title={SCENEFORGE: Enhancing 3D-text alignment with Structured Scene Compositions},
author={Cristian Sbrolli and Matteo Matteucci},
year={2025},
eprint={2509.15693},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2509.15693},
}