Online Language Splatting
1University of Delaware, 2Bosch Research North America Center for AI
Video Demonstration

Online, Dense Features
Our approach delivers highly dense and sharp language-aligned features for 3D representations in real-time.
Faster Processing
Plug-in modules can operate at 45 FPS
Process frames 215x faster: 0.8 seconds/frame compared to LangSplat's 2.88 minutes/frame.
Ours "Rug"
LangSplat "Rug"
Ours "Sofa"
LangSplat "Sofa"
Ours "Table"
LangSplat "Table"
Abstract
To enable AI agents to interact seamlessly with both humans and three-dimensional environments, they must not only perceive the 3D world with exceptional accuracy but also establish robust alignments between human language and spatial representations. While prior research has achieved significant progress by integrating language features into geometrically detailed 3D scene representations using 3D Gaussian Splatting, these approaches fundamentally depend on computationally intensive offline preprocessing of language features for each input image, thereby severely limiting their adaptability to novel environments and real-world deployment scenarios.
In this work, we introduce Online Language Splatting, the first comprehensive framework to achieve near real-time, open-vocabulary language mapping within a 3D Gaussian Splatting SLAM system without requiring pre-generated language features. The primary technical challenge we address involves the efficient fusion of high-dimensional language features into 3D representations while maintaining an optimal balance between processing speed, memory utilization, rendering quality, and open-vocabulary capability. Our approach represents a fundamental advancement in the field of language-grounded 3D scene understanding.
Pipeline

Our pipeline integrates 3D Gaussian Splatting with SLAM, using 3D Gaussians as the sole mapping elements. Left: During training, raw images are processed through a High-Resolution CLIP embedding module, which generates high-resolution language features in real-time. These features are compressed via a two-stage CLIP compression module into low-dimensional maps for efficient optimization while preserving open-vocabulary capabilities. RGB and language parameters are optimized separately through disentangled optimization within the 3D Gaussian map to accommodate distinct requirements. Right: At inference, the rendered low-dimensional language map undergoes a two-stage decoding process to reconstruct the full CLIP feature map, enabling open-vocabulary queries to locate target objects, such as “stool”.

Disentangled optimization preserves the image quality without affecting the language quality
Indoor Plant Language Query
3D Evaluation Comparison
Comprehensive evaluation across top-10 categories of Replica scenes
The table below summarizes the results, demonstrating the effectiveness of our approach in achieving high-quality language mapping.

Citation
If you find our work useful in your research, please consider citing our paper:
@inproceedings{onlinelang, title = {Online Language Splatting}, author = {Saimouli Katragadda, Cho-Ying Wu, Yuliang Guo, Xinyu Huang, Guoquan Huang, Liu Ren}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, year = {2025} }