RAG for Robotics: Chunking, Embeddings, and Retrieval

Problem context. Robotics students quickly run into a reality check with large language models (LLMs): your robot may have a lot of knowledge (PDF manuals, maps, datasheets, mission logs), but the LLM has a limited context window. If you try to paste an entire PDF into the prompt, you’ll overflow the context window, pay more latency/cost, and still get answers that miss key details.

Retrieval-Augmented Generation (RAG) is the standard fix: store your documents in a searchable memory, retrieve only the most relevant passages, then ask the LLM to answer using those passages. In robotics terms, RAG is like building a perception-to-action pipeline: you don’t process the entire world at full resolution every frame—you gate computation with attention.

RAG in one line: Query $\rightarrow$ Retrieve relevant chunks $\rightarrow$ Generate answer grounded in retrieved text.

The Minimal RAG Pipeline

A practical RAG system has these stages:

Data ingestion: load PDFs / docs / logs.
Chunking: break large docs into passages that fit into retrieval + prompting.
Embedding: convert each chunk into a vector.
Vector database / index: store vectors for fast similarity search (e.g., IVF Flat in FAISS).
Retrieval: embed the user query, retrieve top-$k$ similar chunks.
Generation: feed retrieved chunks + query into the LLM and produce an answer.

The two most important knobs (and easiest to mess up) are: (1) chunking and (2) retrieval/indexing. If chunking is bad, retrieval is bad. If retrieval is bad, generation hallucinates.

Why Chunking Matters (More Than People Think)

Chunking is your “state representation” for retrieval. The LLM can only answer using what you put in front of it, so if the relevant fact is split across chunks or buried in a huge chunk, the model won’t reliably use it.

Suppose your robot has a maintenance manual PDF and you ask: “What torque spec should I use for the wheel hub bolts?” If the torque table is separated from the wheel hub section by a page break and your chunking splits them, retrieval might fetch the wrong chunk, and the LLM will guess.

Chunking Types

Five types of chunking styles. Here’s what they mean, when they work, and the failure modes.

1) Fixed-Size Chunking (Fast, but Can Break Meaning)

Fixed-size chunking means: every chunk is ~N tokens/words (e.g., 200–400 words), often with overlap. This is the “quick baseline.”

Pros: simple, consistent, fast.
Cons: semantic meaning may be lost; headings/tables can be split; retrieval gets noisy.

# Fixed-size-ish chunking with overlap (simple baseline)
def fixed_word_chunks(words, chunk_size=200, overlap=40):
    chunks = []
    i = 0
    while i < len(words):
        chunk = words[i:i+chunk_size]
        chunks.append(" ".join(chunk))
        i += max(1, chunk_size - overlap)
    return chunks

text = open("manual.txt", "r", encoding="utf-8").read()
words = text.split()
chunks = fixed_word_chunks(words, chunk_size=220, overlap=50)
print(len(chunks), "chunks")

analogy: fixed-size chunking is like uniform downsampling of a point cloud. It’s fast, but it ignores structure (edges, planes, objects).

2) Semantic Chunking (Meaning-Preserving, but Costly)

Semantic chunking tries to keep “ideas” together. A common trick: split into sentences, embed each sentence, then merge adjacent sentences while they’re “similar enough” (above a threshold).

Pros: improves retrieval quality; keeps coherent passages together.
Cons: more compute; chunk sizes become inconsistent; threshold tuning matters.

# Semantic-ish chunking sketch: merge adjacent sentences if embeddings are similar
# (Uses sentence-transformers; install: pip install sentence-transformers)

import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

def semantic_chunks(sentences, model_name="all-MiniLM-L6-v2", sim_threshold=0.75, max_sents=10):
    model = SentenceTransformer(model_name)
    embs = model.encode(sentences, normalize_embeddings=True)

    chunks = []
    cur = [sentences[0]]
    cur_emb = embs[0]

    for i in range(1, len(sentences)):
        sim = float(np.dot(cur_emb, embs[i]))  # cosine since normalized
        if sim >= sim_threshold and len(cur) < max_sents:
            cur.append(sentences[i])
            # update chunk embedding (simple average)
            cur_emb = (cur_emb * (len(cur)-1) + embs[i]) / len(cur)
            cur_emb = cur_emb / np.linalg.norm(cur_emb)
        else:
            chunks.append(" ".join(cur))
            cur = [sentences[i]]
            cur_emb = embs[i]

    chunks.append(" ".join(cur))
    return chunks

This is the type your notes describe: “Semantic then … if they are similar add them in … threshold.”

3) Recursive Chunking (Good Default in Practice)

Recursive chunking tries coarse splits first (paragraphs/sections), then recursively splits the pieces that are too big. It often gives a nice balance: more structure-aware than fixed-size, cheaper than pure semantic merging.

Pros: good default; works on messy PDFs; controllable chunk size.
Cons: still can split tables/figures unless you add special handling.

# Simple recursive splitter: try double-newline, then newline, then sentence-ish, then hard cut
import re

def recursive_split(text, max_chars=1200):
    seps = ["\n\n", "\n", ". "]
    pieces = [text]

    for sep in seps:
        new_pieces = []
        for p in pieces:
            if len(p) <= max_chars:
                new_pieces.append(p)
            else:
                new_pieces.extend([x.strip() for x in p.split(sep) if x.strip()])
        pieces = new_pieces

    # hard cut for leftovers
    final = []
    for p in pieces:
        if len(p) <= max_chars:
            final.append(p)
        else:
            for i in range(0, len(p), max_chars):
                final.append(p[i:i+max_chars])
    return final

4) Structural Chunking (Exploit Document Structure)

Structural chunking uses headings like “Introduction”, “Method”, “Results”, etc. This is excellent for well-formatted documents, because you get human-readable chunks like: “Intro chunk”, “Method chunk 2”, and so on (exactly what your notes mention).

Pros: consistent and human-interpretable; great for manuals/specs; fast if structure is clean.
Cons: if headings are missing/incorrect (common in PDFs), chunks may be too coarse.

# Structural chunking sketch: split on Markdown-ish headings
import re

def heading_chunks(text):
    # Convert some PDFs to text then detect headings by ALL CAPS or numbered sections
    pattern = r"\n(?=(?:\d+\.\d+|\d+\.|[A-Z][A-Z\s]{6,})\n)"
    parts = [p.strip() for p in re.split(pattern, text) if p.strip()]
    return parts

5) LLM-Based Chunking (High Semantic Accuracy, But Stochastic)

you can ask an LLM to segment a document into coherent chunks, but you pay extra tokens and you may get slightly different chunk boundaries each run.

Pros: excellent coherence; handles rapid topic changes.
Cons: slower, costs tokens, and non-deterministic unless you control sampling tightly.

# Prompt idea (pseudo): ask an LLM to produce JSON chunks with titles and text.
# Use temperature=0 for more deterministic behavior.

SYSTEM: "You are a document segmenter. Return JSON only."
USER:
"Split the following text into coherent chunks suitable for retrieval.
Each chunk: {title, content}. Keep content < 250 words. Preserve tables as one chunk when possible.
TEXT:
... (paste text) ..."

Embedding: Turning Chunks into Vectors

After chunking, we embed each chunk into a vector $\mathbf{e} \in \mathbb{R}^d$. Typical retrieval compares a query embedding $\mathbf{q}$ to chunk embeddings $\mathbf{e}_i$ via dot product (often equivalent to cosine similarity if vectors are normalized):

$$ s_i = \mathbf{q}^\top \mathbf{e}_i \quad\text{(dot product)} $$

The system returns the top-$k$ chunks with largest scores $s_i$. Your notes literally say: “embedding query dotproduct … top get chunks.”

# Embed chunks and a query with sentence-transformers
# pip install sentence-transformers

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")

chunk_embeddings = model.encode(chunks, normalize_embeddings=True)
query = "What is the torque spec for the wheel hub bolts?"
q = model.encode([query], normalize_embeddings=True)[0]

scores = chunk_embeddings @ q  # dot product
topk = scores.argsort()[-5:][::-1]
for idx in topk:
    print(scores[idx], chunks[idx][:200], "...\n")

Vector Databases and Indexing

If you have a small dataset, you can brute-force compute $s_i$ for all chunks. But with large PDFs, many documents, or long-running robots that accumulate logs, brute force gets slow. So we use a vector index.

IVF Flat (Inverted File index + flat storage). Intuition: learn nlist coarse centroids (typically via k-means), which partition embedding space into Voronoi cells (one cell per centroid). Each centroid owns an inverted list of vectors assigned to it.

At query time, we first find the nearest centroids to the query (using the coarse quantizer), then compute exact similarities only within those selected lists. This reduces the number of candidate vectors we score compared to brute force.

Flat: exact search over all vectors (slow at scale).
IVF Flat: approximate search by scanning only a few inverted lists; exact scoring inside scanned lists.
Graph indexing: e.g., HNSW builds a navigable graph for fast approximate search.

# FAISS IVF Flat example (fast ANN search)
# pip install faiss-cpu

import faiss
import numpy as np

# Suppose we have embeddings as float32 matrix: (N, d)
X = np.asarray(chunk_embeddings, dtype=np.float32)
N, d = X.shape

nlist = 256          # number of coarse clusters
quantizer = faiss.IndexFlatIP(d)   # inner product (dot product)
index = faiss.IndexIVFFlat(quantizer, d, nlist, faiss.METRIC_INNER_PRODUCT)

# Train on data (required for IVF)
index.train(X)
index.add(X)

# Query
k = 5
q = np.asarray([q], dtype=np.float32)
index.nprobe = 8     # how many clusters to probe (speed/recall tradeoff)
scores, ids = index.search(q, k)

for score, idx in zip(scores[0], ids[0]):
    print(score, chunks[idx][:200], "...\n")

Robotics analogy: IVF is like doing a coarse global localization first (which “region” of the map am I in?), then doing fine pose estimation inside that region.

Generation: Making the LLM Use Retrieved Evidence

The generation step is where many RAG systems fail: they retrieve good chunks, then the prompt doesn’t force grounding. A robust prompt style is:

Give the LLM the retrieved passages as “Context”.
Explicitly instruct it to only use the context (or say “I don’t know”).
Optionally require citations (chunk IDs).

# Prompt template idea (works with most LLM APIs)
def make_prompt(question, retrieved_chunks):
    context = "\n\n".join([f"[Chunk {i}] {c}" for i, c in enumerate(retrieved_chunks)])
    return f"""
You are a robotics assistant. Answer the question using ONLY the provided context.
If the answer is not in the context, say you don't know.

Context:
{context}

Question: {question}

Answer (include chunk numbers as citations like [Chunk 2]):
""".strip()

Putting It Together: A Tiny RAG Loop

# End-to-end skeleton (retrieval + prompt). The LLM call is pseudocode.

def rag_answer(question, index, chunks, embed_model, top_k=5):
    q = embed_model.encode([question], normalize_embeddings=True).astype("float32")
    scores, ids = index.search(q, top_k)

    retrieved = [chunks[i] for i in ids[0] if i != -1]
    prompt = make_prompt(question, retrieved)

    # llm_response = llm.generate(prompt)   # replace with your LLM call
    # return llm_response
    return prompt, retrieved, scores[0], ids[0]

# Example usage:
# prompt, retrieved, scores, ids = rag_answer("What is the torque spec ...", index, chunks, model)

Robotics Example: “VR Glasses Sees a Building” + RAG

Here’s a robotics-flavored system design that matches how students actually build these:

Perception / localization: use VLM + GPS/VPS (visual positioning) to identify the building or candidate place IDs.
Retrieval: use the recognized building ID/name as the query into a knowledge base (campus facts, hours, safety rules, history).
Generation: produce a short spoken answer grounded in retrieved sources.

The key point: don’t ask the LLM to “know” the world. Let perception determine what we’re looking at; let RAG supply facts; let the LLM do language + summarization.

Common Failure Modes (And How Robotics Students Should Debug Them)

Context window overflow: too many retrieved chunks or chunks too large. Fix with smaller chunks, lower $k$, or summarization.
Bad chunk boundaries: key facts split apart. Fix with overlap, structural chunking, or semantic chunking.
“Looks relevant” but wrong retrieval: embedding model mismatch. Fix by changing embedding model, adding reranking, or adding metadata filters.
Hallucination: prompt doesn’t enforce grounding. Fix with stricter instructions + “say I don’t know”.
Too slow: brute force search on large corpora. Fix with IVF/HNSW and tuned parameters (e.g., nprobe).

Practical Defaults (If You Just Want It to Work)

Chunking: recursive chunking + ~10–20% overlap. Add structural rules for headings/tables if possible.
Embeddings: normalize vectors; use dot product (cosine). Keep an eye on domain mismatch (manuals vs Wikipedia vs research papers).
Index: FAISS IVF Flat for scale, or HNSW if you want graph-based ANN.
Retrieval: start with top-$k=5$; consider reranking if precision matters.
Prompt: force “use only context” + cite chunk IDs.