Online Language Splatting
1University of Delaware, 2Bosch Research North America Center for AI
* Corresponding author
Video

Online, Dense Features
Our approach delivers highly dense and sharp language-aligned features for 3D representations in real-time.
Faster Processing
Process frames 215x faster: 0.8 seconds/frame compared to LangSplat's 2.88 minutes/frame.
Ours "Rug"
LangSplat "Rug"
Ours "Sofa"
LangSplat "Sofa"
Ours "Table"
LangSplat "Table"
Abstract
To enable AI agents to interact seamlessly with both humans and 3D environments, they must not only perceive the 3D world accurately but also align human language with 3D spatial representations.
While prior work has made significant progress by integrating language features into geometrically detailed 3D scene representations using 3D Gaussian Splatting, these approaches rely on computationally
intensive offline preprocessing of language features for each input image, limiting adaptability to new environments.
In this work, we introduce Online Language Splatting, the first framework to achieve near real-time, open-vocabulary
language mapping within a 3D Gaussian Splatting SLAM system without requiring pre-generated language features.
The key challenge we address is the efficiency of fusing high-dimensional language features into 3D representations
while maintaining a balance between speed, memory usage, rendering quality, and open-vocabulary capability.
Pipeline

Our pipeline integrates 3D Gaussian Splatting with SLAM, using 3D Gaussians as the sole mapping elements. Left: During training, raw images are processed through a High-Resolution CLIP embedding module, which generates high-resolution language features in real-time. These features are compressed via a two-stage CLIP compression module into low-dimensional maps for efficient optimization while preserving open-vocabulary capabilities. RGB and language parameters are optimized separately through disentangled optimization within the 3D Gaussian map to accommodate distinct requirements. Right: At inference, the rendered low-dimensional language map undergoes a two-stage decoding process to reconstruct the full CLIP feature map, enabling open-vocabulary queries to locate target objects, such as “stool”.

Disentangled optimization preserves the image quality without affecting the language quality
Indoor Plant Language Query
3D Evaluation Comparison
Comprehensive evaluation across top-10 categories of Replica scenes
The table below summarizes the results, demonstrating the effectiveness of our approach in achieving high-quality language mapping.

Citation
@inproceedings{onlinelang, title = {Online Language Splatting}, author = {Saimouli Katragadda, Cho-Ying Wu, Yuliang Guo, Xinyu Huang, Guoquan Huang, Liu Ren}, journal = {arxiv}, year = {2025} }