Problem context. Robotics students quickly run into a reality check with large language models (LLMs): your robot may have a lot of knowledge (PDF manuals, maps, datasheets, mission logs), but the LLM has a limited context window. If you try to paste an entire PDF into the prompt, you’ll overflow the context window, pay more latency/cost, and still get answers that miss key details.
Retrieval-Augmented Generation (RAG) is the standard fix: store your documents in a searchable memory, retrieve only the most relevant passages, then ask the LLM to answer using those passages. In robotics terms, RAG is like building a perception-to-action pipeline: you don’t process the entire world at full resolution every frame—you gate computation with attention.
RAG in one line: Query $\rightarrow$ Retrieve relevant chunks $\rightarrow$ Generate answer grounded in retrieved text.
A practical RAG system has these stages:
The two most important knobs (and easiest to mess up) are: (1) chunking and (2) retrieval/indexing. If chunking is bad, retrieval is bad. If retrieval is bad, generation hallucinates.
Chunking is your “state representation” for retrieval. The LLM can only answer using what you put in front of it, so if the relevant fact is split across chunks or buried in a huge chunk, the model won’t reliably use it.
Suppose your robot has a maintenance manual PDF and you ask: “What torque spec should I use for the wheel hub bolts?” If the torque table is separated from the wheel hub section by a page break and your chunking splits them, retrieval might fetch the wrong chunk, and the LLM will guess.
Five types of chunking styles. Here’s what they mean, when they work, and the failure modes.
Fixed-size chunking means: every chunk is ~N tokens/words (e.g., 200–400 words), often with overlap. This is the “quick baseline.”
# Fixed-size-ish chunking with overlap (simple baseline)
def fixed_word_chunks(words, chunk_size=200, overlap=40):
chunks = []
i = 0
while i < len(words):
chunk = words[i:i+chunk_size]
chunks.append(" ".join(chunk))
i += max(1, chunk_size - overlap)
return chunks
text = open("manual.txt", "r", encoding="utf-8").read()
words = text.split()
chunks = fixed_word_chunks(words, chunk_size=220, overlap=50)
print(len(chunks), "chunks")
analogy: fixed-size chunking is like uniform downsampling of a point cloud. It’s fast, but it ignores structure (edges, planes, objects).
Semantic chunking tries to keep “ideas” together. A common trick: split into sentences, embed each sentence, then merge adjacent sentences while they’re “similar enough” (above a threshold).
# Semantic-ish chunking sketch: merge adjacent sentences if embeddings are similar
# (Uses sentence-transformers; install: pip install sentence-transformers)
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
def semantic_chunks(sentences, model_name="all-MiniLM-L6-v2", sim_threshold=0.75, max_sents=10):
model = SentenceTransformer(model_name)
embs = model.encode(sentences, normalize_embeddings=True)
chunks = []
cur = [sentences[0]]
cur_emb = embs[0]
for i in range(1, len(sentences)):
sim = float(np.dot(cur_emb, embs[i])) # cosine since normalized
if sim >= sim_threshold and len(cur) < max_sents:
cur.append(sentences[i])
# update chunk embedding (simple average)
cur_emb = (cur_emb * (len(cur)-1) + embs[i]) / len(cur)
cur_emb = cur_emb / np.linalg.norm(cur_emb)
else:
chunks.append(" ".join(cur))
cur = [sentences[i]]
cur_emb = embs[i]
chunks.append(" ".join(cur))
return chunks
This is the type your notes describe: “Semantic then … if they are similar add them in … threshold.”
Recursive chunking tries coarse splits first (paragraphs/sections), then recursively splits the pieces that are too big. It often gives a nice balance: more structure-aware than fixed-size, cheaper than pure semantic merging.
# Simple recursive splitter: try double-newline, then newline, then sentence-ish, then hard cut
import re
def recursive_split(text, max_chars=1200):
seps = ["\n\n", "\n", ". "]
pieces = [text]
for sep in seps:
new_pieces = []
for p in pieces:
if len(p) <= max_chars:
new_pieces.append(p)
else:
new_pieces.extend([x.strip() for x in p.split(sep) if x.strip()])
pieces = new_pieces
# hard cut for leftovers
final = []
for p in pieces:
if len(p) <= max_chars:
final.append(p)
else:
for i in range(0, len(p), max_chars):
final.append(p[i:i+max_chars])
return final
Structural chunking uses headings like “Introduction”, “Method”, “Results”, etc. This is excellent for well-formatted documents, because you get human-readable chunks like: “Intro chunk”, “Method chunk 2”, and so on (exactly what your notes mention).
# Structural chunking sketch: split on Markdown-ish headings
import re
def heading_chunks(text):
# Convert some PDFs to text then detect headings by ALL CAPS or numbered sections
pattern = r"\n(?=(?:\d+\.\d+|\d+\.|[A-Z][A-Z\s]{6,})\n)"
parts = [p.strip() for p in re.split(pattern, text) if p.strip()]
return parts
you can ask an LLM to segment a document into coherent chunks, but you pay extra tokens and you may get slightly different chunk boundaries each run.
# Prompt idea (pseudo): ask an LLM to produce JSON chunks with titles and text.
# Use temperature=0 for more deterministic behavior.
SYSTEM: "You are a document segmenter. Return JSON only."
USER:
"Split the following text into coherent chunks suitable for retrieval.
Each chunk: {title, content}. Keep content < 250 words. Preserve tables as one chunk when possible.
TEXT:
... (paste text) ..."
After chunking, we embed each chunk into a vector $\mathbf{e} \in \mathbb{R}^d$. Typical retrieval compares a query embedding $\mathbf{q}$ to chunk embeddings $\mathbf{e}_i$ via dot product (often equivalent to cosine similarity if vectors are normalized):
$$ s_i = \mathbf{q}^\top \mathbf{e}_i \quad\text{(dot product)} $$The system returns the top-$k$ chunks with largest scores $s_i$. Your notes literally say: “embedding query dotproduct … top get chunks.”
# Embed chunks and a query with sentence-transformers
# pip install sentence-transformers
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer("all-MiniLM-L6-v2")
chunk_embeddings = model.encode(chunks, normalize_embeddings=True)
query = "What is the torque spec for the wheel hub bolts?"
q = model.encode([query], normalize_embeddings=True)[0]
scores = chunk_embeddings @ q # dot product
topk = scores.argsort()[-5:][::-1]
for idx in topk:
print(scores[idx], chunks[idx][:200], "...\n")
If you have a small dataset, you can brute-force compute $s_i$ for all chunks. But with large PDFs, many documents, or long-running robots that accumulate logs, brute force gets slow. So we use a vector index.
IVF Flat (Inverted File index + flat storage). Intuition: learn nlist coarse centroids (typically via k-means), which partition embedding space into Voronoi cells (one cell per centroid). Each centroid owns an inverted list of vectors assigned to it.
At query time, we first find the nearest centroids to the query (using the coarse quantizer), then compute exact similarities only within those selected lists. This reduces the number of candidate vectors we score compared to brute force.
# FAISS IVF Flat example (fast ANN search)
# pip install faiss-cpu
import faiss
import numpy as np
# Suppose we have embeddings as float32 matrix: (N, d)
X = np.asarray(chunk_embeddings, dtype=np.float32)
N, d = X.shape
nlist = 256 # number of coarse clusters
quantizer = faiss.IndexFlatIP(d) # inner product (dot product)
index = faiss.IndexIVFFlat(quantizer, d, nlist, faiss.METRIC_INNER_PRODUCT)
# Train on data (required for IVF)
index.train(X)
index.add(X)
# Query
k = 5
q = np.asarray([q], dtype=np.float32)
index.nprobe = 8 # how many clusters to probe (speed/recall tradeoff)
scores, ids = index.search(q, k)
for score, idx in zip(scores[0], ids[0]):
print(score, chunks[idx][:200], "...\n")
Robotics analogy: IVF is like doing a coarse global localization first (which “region” of the map am I in?), then doing fine pose estimation inside that region.
The generation step is where many RAG systems fail: they retrieve good chunks, then the prompt doesn’t force grounding. A robust prompt style is:
# Prompt template idea (works with most LLM APIs)
def make_prompt(question, retrieved_chunks):
context = "\n\n".join([f"[Chunk {i}] {c}" for i, c in enumerate(retrieved_chunks)])
return f"""
You are a robotics assistant. Answer the question using ONLY the provided context.
If the answer is not in the context, say you don't know.
Context:
{context}
Question: {question}
Answer (include chunk numbers as citations like [Chunk 2]):
""".strip()
# End-to-end skeleton (retrieval + prompt). The LLM call is pseudocode.
def rag_answer(question, index, chunks, embed_model, top_k=5):
q = embed_model.encode([question], normalize_embeddings=True).astype("float32")
scores, ids = index.search(q, top_k)
retrieved = [chunks[i] for i in ids[0] if i != -1]
prompt = make_prompt(question, retrieved)
# llm_response = llm.generate(prompt) # replace with your LLM call
# return llm_response
return prompt, retrieved, scores[0], ids[0]
# Example usage:
# prompt, retrieved, scores, ids = rag_answer("What is the torque spec ...", index, chunks, model)
Here’s a robotics-flavored system design that matches how students actually build these:
The key point: don’t ask the LLM to “know” the world. Let perception determine what we’re looking at; let RAG supply facts; let the LLM do language + summarization.
nprobe).