Online Language Splatting

1University of Delaware, 2Bosch Research North America Center for AI

* Corresponding author

  • Video

  • LangSLAM Comparison GIF

    Online, Dense Features

    Our approach delivers highly dense and sharp language-aligned features for 3D representations in real-time.

    Faster Processing

    Process frames 215x faster: 0.8 seconds/frame compared to LangSplat's 2.88 minutes/frame.

    Ours "Rug"

    LangSplat "Rug"

    Ours "Sofa"

    LangSplat "Sofa"

    Ours "Table"

    LangSplat "Table"

    Abstract

    To enable AI agents to interact seamlessly with both humans and 3D environments, they must not only perceive the 3D world accurately but also align human language with 3D spatial representations. While prior work has made significant progress by integrating language features into geometrically detailed 3D scene representations using 3D Gaussian Splatting, these approaches rely on computationally intensive offline preprocessing of language features for each input image, limiting adaptability to new environments.

    In this work, we introduce Online Language Splatting, the first framework to achieve near real-time, open-vocabulary language mapping within a 3D Gaussian Splatting SLAM system without requiring pre-generated language features. The key challenge we address is the efficiency of fusing high-dimensional language features into 3D representations while maintaining a balance between speed, memory usage, rendering quality, and open-vocabulary capability.

    Pipeline

    Pipeline Diagram

    Our pipeline integrates 3D Gaussian Splatting with SLAM, using 3D Gaussians as the sole mapping elements. Left: During training, raw images are processed through a High-Resolution CLIP embedding module, which generates high-resolution language features in real-time. These features are compressed via a two-stage CLIP compression module into low-dimensional maps for efficient optimization while preserving open-vocabulary capabilities. RGB and language parameters are optimized separately through disentangled optimization within the 3D Gaussian map to accommodate distinct requirements. Right: At inference, the rendered low-dimensional language map undergoes a two-stage decoding process to reconstruct the full CLIP feature map, enabling open-vocabulary queries to locate target objects, such as “stool”.

    Pipeline Diagram

    Disentangled optimization preserves the image quality without affecting the language quality

    Indoor Plant Language Query

    3D Evaluation Comparison

    Comprehensive evaluation across top-10 categories of Replica scenes

    The table below summarizes the results, demonstrating the effectiveness of our approach in achieving high-quality language mapping.

    Table Results

    Citation

    @inproceedings{onlinelang,
        title = {Online Language Splatting},
        author = {Saimouli Katragadda, Cho-Ying Wu, Yuliang Guo, Xinyu Huang, Guoquan Huang, Liu Ren},
        journal = {arxiv},
        year = {2025}
    }