Retrieval-Augmented Generation in Go

The Problem

LLMs confidently fabricate facts when asked about domain-specific or recent information they weren’t trained on. Providing the entire knowledge base in every prompt is impractical — it exceeds context limits and wastes tokens. The model needs the right documents at the right time, not all documents all the time.

The Solution

RAG separates retrieval from generation. An Embedder converts the user’s question into a dense vector. A Retriever finds the top-K most semantically similar document chunks. A PromptBuilder formats those chunks as context. Only then does the LLM generate a response — grounded in retrieved evidence rather than parametric memory. In Go, a RAGPipeline wires these three steps into one Answer() call.

Structure

RAG Pattern

Step 1 of 4

Question Enters the Pipeline

The caller passes a natural-language question to RAGPipeline.Answer(). The pipeline owns the full retrieval-generation flow; callers don't interact with the embedder or retriever directly.

sequenceDiagram
participant User
participant Pipeline as RAGPipeline
participant Embedder
participant Retriever as InMemoryRetriever
participant Builder as PromptBuilder
participant LLM

User->>Pipeline: Answer(question)
Pipeline->>Embedder: Embed(question)
Embedder-->>Pipeline: queryVector
Pipeline->>Retriever: Retrieve(question, topK)
Retriever-->>Pipeline: []Chunk (ranked)
Pipeline->>Builder: buildPrompt(question, chunks)
Builder-->>Pipeline: groundedPrompt
Pipeline->>LLM: Generate(groundedPrompt)
LLM-->>User: Answer (with citations)

Implementation

package main

import "context"

// Chunk is a retrieved document fragment with its source reference.
type Chunk struct {
	ID      string
	Content string
	Source  string
	Score   float32 // cosine similarity, higher is better
}

// Embedder converts a text string into a dense vector.
type Embedder interface {
	Embed(ctx context.Context, text string) ([]float32, error)
}

// Retriever fetches the top-K most relevant chunks for a query vector.
type Retriever interface {
	Retrieve(ctx context.Context, query string, topK int) ([]Chunk, error)
}

// Document is a source document indexed in the retriever.
type Document struct {
	ID        string
	Content   string
	Source    string
	Embedding []float32
}

package main

import (
	"context"
	"fmt"
	"math"
	"strings"
)

// RAGPipeline wires embeddings, retrieval, and generation into one call.
type RAGPipeline struct {
	embedder  Embedder
	retriever Retriever
	topK      int
}

func NewRAGPipeline(embedder Embedder, retriever Retriever, topK int) *RAGPipeline {
	return &RAGPipeline{embedder: embedder, retriever: retriever, topK: topK}
}

// Answer retrieves relevant chunks and builds a grounded prompt.
func (p *RAGPipeline) Answer(ctx context.Context, question string) (string, error) {
	chunks, err := p.retriever.Retrieve(ctx, question, p.topK)
	if err != nil {
		return "", fmt.Errorf("retrieval: %w", err)
	}
	prompt := buildPrompt(question, chunks)
	// In production: send prompt to LLM and return the response.
	return fmt.Sprintf("[LLM response to]: %s", prompt), nil
}

func buildPrompt(question string, chunks []Chunk) string {
	var sb strings.Builder
	sb.WriteString("Answer using only the provided context.\n\nContext:\n")
	for i, c := range chunks {
		fmt.Fprintf(&sb, "%d. [%s] %s\n", i+1, c.Source, c.Content)
	}
	fmt.Fprintf(&sb, "\nQuestion: %s\nAnswer:", question)
	return sb.String()
}

// InMemoryRetriever indexes documents in memory and ranks by cosine similarity.
type InMemoryRetriever struct {
	embedder  Embedder
	documents []Document
}

func NewInMemoryRetriever(embedder Embedder) *InMemoryRetriever {
	return &InMemoryRetriever{embedder: embedder}
}

func (r *InMemoryRetriever) Index(ctx context.Context, docs []Document) error {
	for i, doc := range docs {
		vec, err := r.embedder.Embed(ctx, doc.Content)
		if err != nil {
			return fmt.Errorf("embed %q: %w", doc.ID, err)
		}
		docs[i].Embedding = vec
	}
	r.documents = docs
	return nil
}

func (r *InMemoryRetriever) Retrieve(ctx context.Context, query string, topK int) ([]Chunk, error) {
	queryVec, err := r.embedder.Embed(ctx, query)
	if err != nil {
		return nil, err
	}

	type scored struct {
		doc   Document
		score float32
	}
	scores := make([]scored, 0, len(r.documents))
	for _, doc := range r.documents {
		scores = append(scores, scored{doc: doc, score: cosineSimilarity(queryVec, doc.Embedding)})
	}

	// Partial sort: pick top-K without a full sort for large collections.
	if topK > len(scores) {
		topK = len(scores)
	}
	for i := range topK {
		for j := i + 1; j < len(scores); j++ {
			if scores[j].score > scores[i].score {
				scores[i], scores[j] = scores[j], scores[i]
			}
		}
	}

	chunks := make([]Chunk, topK)
	for i, s := range scores[:topK] {
		chunks[i] = Chunk{ID: s.doc.ID, Content: s.doc.Content, Source: s.doc.Source, Score: s.score}
	}
	return chunks, nil
}

func cosineSimilarity(a, b []float32) float32 {
	if len(a) != len(b) {
		return 0
	}
	var dot, normA, normB float64
	for i := range a {
		dot += float64(a[i]) * float64(b[i])
		normA += float64(a[i]) * float64(a[i])
		normB += float64(b[i]) * float64(b[i])
	}
	if normA == 0 || normB == 0 {
		return 0
	}
	return float32(dot / (math.Sqrt(normA) * math.Sqrt(normB)))
}

// stubEmbedder returns a fixed-length zero vector (replace with a real model).
type stubEmbedder struct{ dim int }

func (s *stubEmbedder) Embed(_ context.Context, _ string) ([]float32, error) {
	return make([]float32, s.dim), nil
}

package main

import (
	"context"
	"fmt"
	"log"
)

func main() {
	embedder := &stubEmbedder{dim: 4}
	retriever := NewInMemoryRetriever(embedder)

	docs := []Document{
		{ID: "1", Content: "Goroutines are lightweight threads managed by Go runtime.", Source: "go-spec.md"},
		{ID: "2", Content: "Channels provide communication between goroutines.", Source: "go-spec.md"},
		{ID: "3", Content: "The select statement lets a goroutine wait on multiple channels.", Source: "go-spec.md"},
	}

	ctx := context.Background()
	if err := retriever.Index(ctx, docs); err != nil {
		log.Fatalf("index: %v", err)
	}

	pipeline := NewRAGPipeline(embedder, retriever, 2)
	answer, err := pipeline.Answer(ctx, "How do goroutines communicate?")
	if err != nil {
		log.Fatalf("rag: %v", err)
	}

	fmt.Println(answer)
}

Real-World Analogy

A lawyer preparing a brief: rather than memorizing every case ever decided, they search the legal database for the three most relevant precedents and cite them directly. The closing argument is grounded in retrieved evidence, not recollection. RAG does the same for LLMs.

Pros and Cons

Pros	Cons
Dramatically reduces hallucination on factual questions	Retrieval quality determines answer quality — garbage in, garbage out
Knowledge can be updated without retraining the model	Embedding every document adds upfront indexing cost
Retrieved chunks provide a citation trail for auditability	Top-K may miss relevant context if the query is ambiguous
`Retriever` interface makes swapping vector stores a one-line change	Cosine similarity on stub embeddings returns meaningless results in tests

Best Practices

Chunk documents at semantic boundaries (paragraphs, sections) not fixed byte sizes — retrieval quality depends heavily on chunk coherence.
Store Source in every Chunk and include it in the prompt so the LLM can cite its references.
Set topK conservatively (3–5) — more chunks don’t always mean better answers and they consume context budget fast.
Cache embeddings for documents that don’t change; re-embedding on every query is wasteful.
Use a real embedding model in tests against a small fixture corpus, not a stub — RAG bugs are usually retrieval bugs, not generation bugs.

When to Use

Domain-specific Q&A over private documentation, codebases, or knowledge bases.
Any application where the model must cite sources or avoid fabrication.
Systems where the knowledge base changes frequently (news, support tickets, API docs).

When NOT to Use

General knowledge questions where the base model’s training data is sufficient.
Real-time applications where retrieval latency is unacceptable.
Tiny knowledge bases — a few documents can simply be included in the system prompt.

Memory in Go Give an agent a pluggable memory store — short-term context buffer, episodic session history, and optional long-term semantic store — through one interface so the storage backend can be swapped. Proxy in Go Control access to another object by standing in front of it to add caching, authorization, rate limiting, or lazy loading. Facade in Go Provide one simplified entry point over several collaborating subsystems so callers can trigger a workflow without knowing every step. Tool Use in Go Register typed tools by name so an agent can dispatch LLM-selected function calls to the right handler without embedding dispatch logic in the agent itself.