Online Language Splatting

¹University of Delaware, ²Bosch Research North America Center for AI

* Corresponding author

Video

Online, Dense Features

Our approach delivers highly dense and sharp language-aligned features for 3D representations in real-time.

Faster Processing

Plug-in modules can operate at 45 FPS

Process frames 215x faster: 0.8 seconds/frame compared to LangSplat's 2.88 minutes/frame.

Ours "Rug"

LangSplat "Rug"

Ours "Sofa"

LangSplat "Sofa"

Ours "Table"

LangSplat "Table"

Abstract

To enable AI agents to interact seamlessly with both humans and 3D environments, they must not only perceive the 3D world accurately but also align human language with 3D spatial representations. While prior work has made significant progress by integrating language features into geometrically detailed 3D scene representations using 3D Gaussian Splatting, these approaches rely on computationally intensive offline preprocessing of language features for each input image, limiting adaptability to new environments.

In this work, we introduce Online Language Splatting, the first framework to achieve near real-time, open-vocabulary language mapping within a 3D Gaussian Splatting SLAM system without requiring pre-generated language features. The key challenge we address is the efficiency of fusing high-dimensional language features into 3D representations while maintaining a balance between speed, memory usage, rendering quality, and open-vocabulary capability.

Pipeline

Our pipeline integrates 3D Gaussian Splatting with SLAM, using 3D Gaussians as the sole mapping elements. Left: During training, raw images are processed through a High-Resolution CLIP embedding module, which generates high-resolution language features in real-time. These features are compressed via a two-stage CLIP compression module into low-dimensional maps for efficient optimization while preserving open-vocabulary capabilities. RGB and language parameters are optimized separately through disentangled optimization within the 3D Gaussian map to accommodate distinct requirements. Right: At inference, the rendered low-dimensional language map undergoes a two-stage decoding process to reconstruct the full CLIP feature map, enabling open-vocabulary queries to locate target objects, such as “stool”.

Disentangled optimization preserves the image quality without affecting the language quality

Indoor Plant Language Query

3D Evaluation Comparison

Comprehensive evaluation across top-10 categories of Replica scenes

The table below summarizes the results, demonstrating the effectiveness of our approach in achieving high-quality language mapping.

Citation

@inproceedings{onlinelang,
    title = {Online Language Splatting},
    author = {Saimouli Katragadda, Cho-Ying Wu, Yuliang Guo, Xinyu Huang, Guoquan Huang, Liu Ren},
    journal = {arxiv},
    year = {2025}
}

Online Language Splatting

Paper

Video

Code

Video

Online, Dense Features

Faster Processing

Ours "Rug"

LangSplat "Rug"

Ours "Sofa"

LangSplat "Sofa"

Ours "Table"

LangSplat "Table"

Abstract

Pipeline

Indoor Plant Language Query

3D Evaluation Comparison

Comprehensive evaluation across top-10 categories of Replica scenes

Citation