Online Language Splatting

ICCV 2025

Online Language Splatting introduces a fully online system to effectively integrate dense CLIP features with Gaussian Splatting, providing a potential form of physical memory to support real-time human–machine interaction and world modeling.

1University of Delaware     2Bosch Research North America
*Project Lead
Short Video
Presentation
LangSLAM Comparison GIF

Online, Dense Features

Our approach delivers highly dense and sharp language-aligned features for 3D representations in real-time.

Faster Processing

Plug-in modules can operate at 45 FPS

Process frames 215x faster: 0.8 seconds/frame compared to LangSplat's 2.88 minutes/frame.

Ours "Rug"

LangSplat "Rug"

Ours "Sofa"

LangSplat "Sofa"

Ours "Table"

LangSplat "Table"

Abstract

To enable AI agents to interact seamlessly with both humans and three-dimensional environments, they must not only perceive the 3D world with exceptional accuracy but also establish robust alignments between human language and spatial representations. While prior research has achieved significant progress by integrating language features into geometrically detailed 3D scene representations using 3D Gaussian Splatting, these approaches fundamentally depend on computationally intensive offline preprocessing of language features for each input image, thereby severely limiting their adaptability to novel environments and real-world deployment scenarios.

In this work, we introduce Online Language Splatting, the first comprehensive framework to achieve near real-time, open-vocabulary language mapping within a 3D Gaussian Splatting SLAM system without requiring pre-generated language features. The primary technical challenge we address involves the efficient fusion of high-dimensional language features into 3D representations while maintaining an optimal balance between processing speed, memory utilization, rendering quality, and open-vocabulary capability. Our approach represents a fundamental advancement in the field of language-grounded 3D scene understanding.

Pipeline

Pipeline Diagram

Our pipeline integrates 3D Gaussian Splatting with SLAM, using 3D Gaussians as the sole mapping elements. Left: During training, raw images are processed through a High-Resolution CLIP embedding module, which generates high-resolution language features in real-time. These features are compressed via a two-stage CLIP compression module into low-dimensional maps for efficient optimization while preserving open-vocabulary capabilities. RGB and language parameters are optimized separately through disentangled optimization within the 3D Gaussian map to accommodate distinct requirements. Right: At inference, the rendered low-dimensional language map undergoes a two-stage decoding process to reconstruct the full CLIP feature map, enabling open-vocabulary queries to locate target objects, such as "stool".

Optimization Diagram

Disentangled optimization preserves the image quality without affecting the language quality.

Indoor Plant Language Query

3D Evaluation Comparison

Citation

If you find our work useful in your research, please consider citing our paper:

@inproceedings{onlinelang,
    title = {Online Language Splatting},
    author = {Saimouli Katragadda, Cho-Ying Wu, Yuliang Guo, Xinyu Huang, Guoquan Huang, Liu Ren},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    year = {2025}
}