RAG for Robotics: Chunking, Embeddings, and Retrieval

Problem context. Robotics students quickly run into a reality check with large language models (LLMs): your robot may have a lot of knowledge (PDF manuals, maps, datasheets, mission logs), but the LLM has a limited context window. If you try to paste an entire PDF into the prompt, you’ll overflow the context window, pay more latency/cost, and still get answers that miss key details.

Retrieval-Augmented Generation (RAG) is the standard fix: store your documents in a searchable memory, retrieve only the most relevant passages, then ask the LLM to answer using those passages. In robotics terms, RAG is like building a perception-to-action pipeline: you don’t process the entire world at full resolution every frame—you gate computation with attention.

RAG in one line: Query $\rightarrow$ Retrieve relevant chunks $\rightarrow$ Generate answer grounded in retrieved text.

The Minimal RAG Pipeline

A practical RAG system has these stages:

The two most important knobs (and easiest to mess up) are: (1) chunking and (2) retrieval/indexing. If chunking is bad, retrieval is bad. If retrieval is bad, generation hallucinates.

Why Chunking Matters (More Than People Think)

Chunking is your “state representation” for retrieval. The LLM can only answer using what you put in front of it, so if the relevant fact is split across chunks or buried in a huge chunk, the model won’t reliably use it.

Suppose your robot has a maintenance manual PDF and you ask: “What torque spec should I use for the wheel hub bolts?” If the torque table is separated from the wheel hub section by a page break and your chunking splits them, retrieval might fetch the wrong chunk, and the LLM will guess.

Chunking Types

Five types of chunking styles. Here’s what they mean, when they work, and the failure modes.

1) Fixed-Size Chunking (Fast, but Can Break Meaning)

Fixed-size chunking means: every chunk is ~N tokens/words (e.g., 200–400 words), often with overlap. This is the “quick baseline.”

# Fixed-size-ish chunking with overlap (simple baseline)
def fixed_word_chunks(words, chunk_size=200, overlap=40):
    chunks = []
    i = 0
    while i < len(words):
        chunk = words[i:i+chunk_size]
        chunks.append(" ".join(chunk))
        i += max(1, chunk_size - overlap)
    return chunks

text = open("manual.txt", "r", encoding="utf-8").read()
words = text.split()
chunks = fixed_word_chunks(words, chunk_size=220, overlap=50)
print(len(chunks), "chunks")

analogy: fixed-size chunking is like uniform downsampling of a point cloud. It’s fast, but it ignores structure (edges, planes, objects).

2) Semantic Chunking (Meaning-Preserving, but Costly)

Semantic chunking tries to keep “ideas” together. A common trick: split into sentences, embed each sentence, then merge adjacent sentences while they’re “similar enough” (above a threshold).

# Semantic-ish chunking sketch: merge adjacent sentences if embeddings are similar
# (Uses sentence-transformers; install: pip install sentence-transformers)

import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

def semantic_chunks(sentences, model_name="all-MiniLM-L6-v2", sim_threshold=0.75, max_sents=10):
    model = SentenceTransformer(model_name)
    embs = model.encode(sentences, normalize_embeddings=True)

    chunks = []
    cur = [sentences[0]]
    cur_emb = embs[0]

    for i in range(1, len(sentences)):
        sim = float(np.dot(cur_emb, embs[i]))  # cosine since normalized
        if sim >= sim_threshold and len(cur) < max_sents:
            cur.append(sentences[i])
            # update chunk embedding (simple average)
            cur_emb = (cur_emb * (len(cur)-1) + embs[i]) / len(cur)
            cur_emb = cur_emb / np.linalg.norm(cur_emb)
        else:
            chunks.append(" ".join(cur))
            cur = [sentences[i]]
            cur_emb = embs[i]

    chunks.append(" ".join(cur))
    return chunks

This is the type your notes describe: “Semantic then … if they are similar add them in … threshold.”

3) Recursive Chunking (Good Default in Practice)

Recursive chunking tries coarse splits first (paragraphs/sections), then recursively splits the pieces that are too big. It often gives a nice balance: more structure-aware than fixed-size, cheaper than pure semantic merging.

# Simple recursive splitter: try double-newline, then newline, then sentence-ish, then hard cut
import re

def recursive_split(text, max_chars=1200):
    seps = ["\n\n", "\n", ". "]
    pieces = [text]

    for sep in seps:
        new_pieces = []
        for p in pieces:
            if len(p) <= max_chars:
                new_pieces.append(p)
            else:
                new_pieces.extend([x.strip() for x in p.split(sep) if x.strip()])
        pieces = new_pieces

    # hard cut for leftovers
    final = []
    for p in pieces:
        if len(p) <= max_chars:
            final.append(p)
        else:
            for i in range(0, len(p), max_chars):
                final.append(p[i:i+max_chars])
    return final

4) Structural Chunking (Exploit Document Structure)

Structural chunking uses headings like “Introduction”, “Method”, “Results”, etc. This is excellent for well-formatted documents, because you get human-readable chunks like: “Intro chunk”, “Method chunk 2”, and so on (exactly what your notes mention).

# Structural chunking sketch: split on Markdown-ish headings
import re

def heading_chunks(text):
    # Convert some PDFs to text then detect headings by ALL CAPS or numbered sections
    pattern = r"\n(?=(?:\d+\.\d+|\d+\.|[A-Z][A-Z\s]{6,})\n)"
    parts = [p.strip() for p in re.split(pattern, text) if p.strip()]
    return parts

5) LLM-Based Chunking (High Semantic Accuracy, But Stochastic)

you can ask an LLM to segment a document into coherent chunks, but you pay extra tokens and you may get slightly different chunk boundaries each run.

# Prompt idea (pseudo): ask an LLM to produce JSON chunks with titles and text.
# Use temperature=0 for more deterministic behavior.

SYSTEM: "You are a document segmenter. Return JSON only."
USER:
"Split the following text into coherent chunks suitable for retrieval.
Each chunk: {title, content}. Keep content < 250 words. Preserve tables as one chunk when possible.
TEXT:
... (paste text) ..."

Embedding: Turning Chunks into Vectors

After chunking, we embed each chunk into a vector $\mathbf{e} \in \mathbb{R}^d$. Typical retrieval compares a query embedding $\mathbf{q}$ to chunk embeddings $\mathbf{e}_i$ via dot product (often equivalent to cosine similarity if vectors are normalized):

$$ s_i = \mathbf{q}^\top \mathbf{e}_i \quad\text{(dot product)} $$

The system returns the top-$k$ chunks with largest scores $s_i$. Your notes literally say: “embedding query dotproduct … top get chunks.”

# Embed chunks and a query with sentence-transformers
# pip install sentence-transformers

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")

chunk_embeddings = model.encode(chunks, normalize_embeddings=True)
query = "What is the torque spec for the wheel hub bolts?"
q = model.encode([query], normalize_embeddings=True)[0]

scores = chunk_embeddings @ q  # dot product
topk = scores.argsort()[-5:][::-1]
for idx in topk:
    print(scores[idx], chunks[idx][:200], "...\n")

Vector Databases and Indexing

If you have a small dataset, you can brute-force compute $s_i$ for all chunks. But with large PDFs, many documents, or long-running robots that accumulate logs, brute force gets slow. So we use a vector index.

IVF Flat (Inverted File index + flat storage). Intuition: learn nlist coarse centroids (typically via k-means), which partition embedding space into Voronoi cells (one cell per centroid). Each centroid owns an inverted list of vectors assigned to it.

At query time, we first find the nearest centroids to the query (using the coarse quantizer), then compute exact similarities only within those selected lists. This reduces the number of candidate vectors we score compared to brute force.

# FAISS IVF Flat example (fast ANN search)
# pip install faiss-cpu

import faiss
import numpy as np

# Suppose we have embeddings as float32 matrix: (N, d)
X = np.asarray(chunk_embeddings, dtype=np.float32)
N, d = X.shape

nlist = 256          # number of coarse clusters
quantizer = faiss.IndexFlatIP(d)   # inner product (dot product)
index = faiss.IndexIVFFlat(quantizer, d, nlist, faiss.METRIC_INNER_PRODUCT)

# Train on data (required for IVF)
index.train(X)
index.add(X)

# Query
k = 5
q = np.asarray([q], dtype=np.float32)
index.nprobe = 8     # how many clusters to probe (speed/recall tradeoff)
scores, ids = index.search(q, k)

for score, idx in zip(scores[0], ids[0]):
    print(score, chunks[idx][:200], "...\n")

Robotics analogy: IVF is like doing a coarse global localization first (which “region” of the map am I in?), then doing fine pose estimation inside that region.

Generation: Making the LLM Use Retrieved Evidence

The generation step is where many RAG systems fail: they retrieve good chunks, then the prompt doesn’t force grounding. A robust prompt style is:

# Prompt template idea (works with most LLM APIs)
def make_prompt(question, retrieved_chunks):
    context = "\n\n".join([f"[Chunk {i}] {c}" for i, c in enumerate(retrieved_chunks)])
    return f"""
You are a robotics assistant. Answer the question using ONLY the provided context.
If the answer is not in the context, say you don't know.

Context:
{context}

Question: {question}

Answer (include chunk numbers as citations like [Chunk 2]):
""".strip()

Putting It Together: A Tiny RAG Loop

# End-to-end skeleton (retrieval + prompt). The LLM call is pseudocode.

def rag_answer(question, index, chunks, embed_model, top_k=5):
    q = embed_model.encode([question], normalize_embeddings=True).astype("float32")
    scores, ids = index.search(q, top_k)

    retrieved = [chunks[i] for i in ids[0] if i != -1]
    prompt = make_prompt(question, retrieved)

    # llm_response = llm.generate(prompt)   # replace with your LLM call
    # return llm_response
    return prompt, retrieved, scores[0], ids[0]

# Example usage:
# prompt, retrieved, scores, ids = rag_answer("What is the torque spec ...", index, chunks, model)

Robotics Example: “VR Glasses Sees a Building” + RAG

Here’s a robotics-flavored system design that matches how students actually build these:

The key point: don’t ask the LLM to “know” the world. Let perception determine what we’re looking at; let RAG supply facts; let the LLM do language + summarization.

Common Failure Modes (And How Robotics Students Should Debug Them)

Practical Defaults (If You Just Want It to Work)