Ask a general language model about your organisation's 2026 funding rules and it will either say it does not know, or — worse — invent a plausible answer. Retrieval-augmented generation, or RAG, is the standard cure for both.
01The two problems
Large language models have two structural weaknesses. First, their knowledge is frozen at the moment training stopped, so anything newer is invisible. Second, they are trained to produce fluent text, not true text, so when they lack a fact they often hallucinate one with total confidence.
For casual chat this is tolerable. For a help desk, a legal assistant or a school knowledge base, it is disqualifying. RAG addresses both without retraining the model.
02The core idea
The insight is simple: do not ask the model to recall a fact from memory — give it the relevant text and ask it to answer from that. It is the difference between an exam from memory and an open-book exam. The model's reasoning stays; the facts come from a trusted source you control.
03How retrieval works
To fetch the right passage, RAG relies on embeddings — numerical fingerprints of meaning. Every chunk of your documents is converted to a vector and stored. When a question arrives, it is embedded too, and the system finds the chunks whose vectors sit closest in meaning, not just those sharing keywords.
Similarity scores between a user question and stored chunks. "Forgot login" ranks high despite sharing no words with "reset password" — because their meanings are close.
04The RAG pipeline
In production the flow is a short, fixed pipeline. The question is embedded, the closest chunks are retrieved, those chunks are pasted into the prompt alongside the question, and the model generates an answer grounded in them — usually with citations back to the source.
Only the retrieval store changes when your documents change — the model is never retrained. Update a policy file and the next answer reflects it within seconds.
05Why it beats fine-tuning
A common instinct is to "train the model on our data" via fine-tuning. For factual knowledge, RAG usually wins. It is cheaper, it updates instantly when documents change, it can cite its sources, and it keeps sensitive data in a database you control rather than baked irretrievably into model weights. Fine-tuning still has its place — for tone, format and skills — but not for facts that change.
06Common pitfalls
RAG is not magic. If the retrieval step fetches the wrong passage, the model will faithfully answer from the wrong context. Chunks that are too large bury the answer; too small and they lose context. And the model can still ignore the provided text if the prompt is sloppy. Good RAG is mostly good retrieval and chunking, with the language model as the final, fluent step.
What to remember
- LLMs have frozen knowledge and hallucinate when unsure; RAG fixes both.
- RAG retrieves relevant text first, then answers from it — an open-book exam.
- Retrieval uses embeddings to match meaning, not just keywords.
- The pipeline is: embed → retrieve → augment → generate.
- For changing facts, RAG beats fine-tuning on cost, freshness and citations.
- Quality depends mostly on retrieval and chunking, not the model.
