The Problem
LLMs confidently fabricate facts when asked about domain-specific or recent information they weren’t trained on. Providing the entire knowledge base in every prompt is impractical — it exceeds context limits and wastes tokens. The model needs the right documents at the right time, not all documents all the time.
The Solution
RAG separates retrieval from generation. An Embedder converts the user’s question into a dense vector. A Retriever finds the top-K most semantically similar document chunks. A PromptBuilder formats those chunks as context. Only then does the LLM generate a response — grounded in retrieved evidence rather than parametric memory. In Go, a RAGPipeline wires these three steps into one Answer() call.
Structure
Question Enters the Pipeline
The caller passes a natural-language question to RAGPipeline.Answer(). The pipeline owns the full retrieval-generation flow; callers don't interact with the embedder or retriever directly.
sequenceDiagram participant User participant Pipeline as RAGPipeline participant Embedder participant Retriever as InMemoryRetriever participant Builder as PromptBuilder participant LLM User->>Pipeline: Answer(question) Pipeline->>Embedder: Embed(question) Embedder-->>Pipeline: queryVector Pipeline->>Retriever: Retrieve(question, topK) Retriever-->>Pipeline: []Chunk (ranked) Pipeline->>Builder: buildPrompt(question, chunks) Builder-->>Pipeline: groundedPrompt Pipeline->>LLM: Generate(groundedPrompt) LLM-->>User: Answer (with citations)
Implementation
package main
import "context"
// Chunk is a retrieved document fragment with its source reference.
type Chunk struct {
ID string
Content string
Source string
Score float32 // cosine similarity, higher is better
}
// Embedder converts a text string into a dense vector.
type Embedder interface {
Embed(ctx context.Context, text string) ([]float32, error)
}
// Retriever fetches the top-K most relevant chunks for a query vector.
type Retriever interface {
Retrieve(ctx context.Context, query string, topK int) ([]Chunk, error)
}
// Document is a source document indexed in the retriever.
type Document struct {
ID string
Content string
Source string
Embedding []float32
} Real-World Analogy
A lawyer preparing a brief: rather than memorizing every case ever decided, they search the legal database for the three most relevant precedents and cite them directly. The closing argument is grounded in retrieved evidence, not recollection. RAG does the same for LLMs.
Pros and Cons
| Pros | Cons |
|---|---|
| Dramatically reduces hallucination on factual questions | Retrieval quality determines answer quality — garbage in, garbage out |
| Knowledge can be updated without retraining the model | Embedding every document adds upfront indexing cost |
| Retrieved chunks provide a citation trail for auditability | Top-K may miss relevant context if the query is ambiguous |
Retriever interface makes swapping vector stores a one-line change | Cosine similarity on stub embeddings returns meaningless results in tests |
Best Practices
- Chunk documents at semantic boundaries (paragraphs, sections) not fixed byte sizes — retrieval quality depends heavily on chunk coherence.
- Store
Sourcein everyChunkand include it in the prompt so the LLM can cite its references. - Set
topKconservatively (3–5) — more chunks don’t always mean better answers and they consume context budget fast. - Cache embeddings for documents that don’t change; re-embedding on every query is wasteful.
- Use a real embedding model in tests against a small fixture corpus, not a stub — RAG bugs are usually retrieval bugs, not generation bugs.
When to Use
- Domain-specific Q&A over private documentation, codebases, or knowledge bases.
- Any application where the model must cite sources or avoid fabrication.
- Systems where the knowledge base changes frequently (news, support tickets, API docs).
When NOT to Use
- General knowledge questions where the base model’s training data is sufficient.
- Real-time applications where retrieval latency is unacceptable.
- Tiny knowledge bases — a few documents can simply be included in the system prompt.