PEFT and RAG: two ways to make a model yours
There are two questions you’ll keep hitting once you move past prompting a base model: how do I make it behave the way I want and how do I make it know things it doesn’t. These are different problems with different answers, and conflating them is the single most common mistake teams make. The answer to the first is usually PEFT — parameter-efficient fine-tuning. The answer to the second is usually RAG — retrieval-augmented generation.
The one-line distinction
Keep this in your head for the whole post:
PEFT changes the model’s weights to change its behavior. RAG changes the model’s context to change what it knows right now.
Fine-tuning bakes new behavior into the parameters. Retrieval injects new facts into the prompt at runtime. One edits the brain; the other hands it a briefing document before it answers. Almost every “should I fine-tune or do RAG?” argument dissolves once you ask: am I missing behavior, or am I missing knowledge?
What PEFT is
A modern LLM has billions of parameters. Full fine-tuning updates all of them on your data — which works, but is expensive (you need enough GPU memory to hold the model, its gradients, and optimizer state) and produces a full-size copy of the model for every task you train.
Parameter-efficient fine-tuning is the family of techniques that get most of the benefit by training a tiny fraction of the parameters and freezing the rest. The base model stays untouched; you train a small set of new weights that steer it.
The dominant method is LoRA (Low-Rank Adaptation). The intuition: when you fine-tune, the change to a big weight matrix turns out to be low-rank — it can be approximated by multiplying two much smaller matrices. So instead of updating a 4096×4096 matrix (~16M numbers), you train two skinny matrices like 4096×8 and 8×4096 (~64K numbers) and add their product to the frozen original. You’re training well under 1% of the parameters and getting close to full-fine-tune quality.
A few variants worth knowing by name:
- LoRA — the baseline; small low-rank adapters added to frozen weights.
- QLoRA — quantize the frozen base model to 4-bit, then train LoRA adapters on top. This is what lets people fine-tune a large model on a single consumer GPU.
- Prefix / prompt tuning — instead of adapter matrices, train a small set of virtual “soft prompt” tokens prepended to every input. Even fewer parameters, though usually less expressive than LoRA.
The practical payoffs: adapters are tiny (megabytes, not gigabytes), so you can keep many of them and hot-swap per task on one base model; training is cheap enough to do on modest hardware; and because the base weights are frozen, you don’t risk catastrophically forgetting the model’s general abilities the way aggressive full fine-tuning can.
What RAG is
RAG doesn’t touch the model at all. It changes what goes into the prompt.
The pipeline, end to end:
- Index (offline). Take your documents, split them into chunks, run each chunk through an embedding model to get a vector, and store those vectors in a vector database.
- Retrieve (at query time). Embed the user’s question with the same model, find the chunks whose vectors are nearest to it, and pull back the top-k most relevant ones.
- Augment. Paste those chunks into the prompt as context, alongside the question and an instruction like “answer using only the context below.”
- Generate. The LLM answers, grounded in the retrieved text.
The model never learned your data. It’s reading it, fresh, every single time it answers. That one fact explains every strength and weakness of RAG:
- Update a document and re-embed it, and the system “knows” the new version on the next query — no retraining.
- You can show which chunks produced an answer, so you get citations and auditability mostly for free.
- It controls hallucination by giving the model real source text instead of asking it to recall from weights.
- But it’s only as good as its retrieval. If the right chunk isn’t fetched, the model can’t use it — “garbage retrieved, garbage generated.” And every query pays the latency and token cost of stuffing context into the prompt.
When to use which
This is the part that actually matters. Map your problem to the right tool.
Reach for PEFT (fine-tuning) when the gap is behavior:
- You need a specific format, tone, or style consistently — always emit valid JSON in your schema, always write in your brand voice, always follow a fixed reasoning structure.
- You’re teaching a skill or task, not facts — classification, structured extraction, a domain’s way of reasoning, function-calling conventions.
- You want to internalize a narrow domain’s language — legal, medical, a specific codebase’s idioms — so the model stops sounding generic.
- You need lower latency / fewer tokens at inference — the behavior is baked in, so you don’t spend prompt budget re-explaining it every call.
Reach for RAG when the gap is knowledge:
- The information changes often — prices, policies, docs, tickets, this week’s data. You can’t retrain every time a wiki page changes.
- You need factual accuracy with citations — “where did this answer come from?” must be answerable.
- The knowledge base is large — far more than fits in a context window or could be memorized reliably.
- The data is private or per-user and you want it gated by access control at query time, not frozen into shared weights.
A quick decision test:
If the failure is “it answered in the wrong way,” fine-tune. If the failure is “it answered with the wrong facts,” use RAG.
And the honest default: try RAG (and good prompting) first. It’s cheaper to build, requires no GPUs, and is far easier to update and debug. Fine-tuning is the move when prompting and retrieval have hit a ceiling on behavior you can’t prompt your way out of.
When to use both
The two aren’t rivals — they target different axes, so the strongest systems combine them. Fine-tune the model so it reliably follows your format and reasons in your domain’s style, and bolt on RAG so it answers from current, citable facts. A support assistant is the canonical example: PEFT teaches it your tone, your escalation rules, and your structured-response format; RAG feeds it the live help-center articles and the customer’s account state. Neither alone gets you there — behavior without knowledge is fluent and wrong; knowledge without behavior is right and off-brand.
How to build PEFT
The mechanics with LoRA/QLoRA, at the level you need to start:
- Build the dataset. This is 80% of the work. You need input→output examples that demonstrate the behavior you want — typically a few hundred to a few thousand high-quality, consistent examples. Quality and consistency beat volume; a thousand clean examples outperform ten thousand noisy ones.
- Pick a base model and rank. Choose an open base model that fits your hardware.
The key LoRA knob is rank (
r) — higher rank means more trainable capacity (8–16 is a common starting range). Set the LoRAalpha(scaling) and choose which layers to target (attention projections are typical). - Train. With QLoRA, load the base model in 4-bit, attach adapters, and
train for a small number of epochs (often 1–3 — too many and it overfits and
forgets). Watch a held-out validation set. Tools like Hugging Face
peft+transformers, or Axolotl/Unsloth, do the heavy lifting. - Evaluate. Test on examples it never saw, and specifically check it hasn’t regressed on general ability while learning the new behavior.
- Serve. Either merge the adapter into the base weights for a standalone model, or keep adapters separate and hot-swap them per task at inference — one base model, many cheap adapters.
The recurring failure mode is the dataset, not the algorithm. If the behavior is inconsistent in your examples, it’ll be inconsistent in the model.
How to build RAG
The mechanics, and where the quality actually lives:
- Chunk well. Split documents into pieces that are self-contained but not too big — small enough to be specific, large enough to carry context. Respect structure (headings, paragraphs) rather than cutting at arbitrary character counts, and overlap chunks slightly so you don’t sever an idea mid-sentence. Bad chunking is the most common reason RAG underperforms.
- Embed and store. Run chunks through an embedding model and load the vectors into a vector store (pgvector, Pinecone, Weaviate, Qdrant, FAISS — many options). Store the source text and metadata alongside each vector.
- Retrieve well. Top-k vector search is the baseline. The upgrades that matter
in practice:
- Hybrid search — combine semantic (vector) search with keyword (BM25) search, so exact terms like error codes and product names aren’t lost.
- Reranking — fetch a larger candidate set, then use a cross-encoder reranker to reorder by true relevance before sending the top few to the model. This is often the single highest-leverage improvement.
- Metadata filtering — constrain retrieval by date, source, or access permissions before ranking.
- Assemble the prompt. Insert the retrieved chunks with a clear instruction to answer only from the provided context, and say “I don’t know” if the answer isn’t there. Mind the context budget — more chunks isn’t always better.
- Generate and cite. Return the answer with references back to the source chunks so a human can verify it.
The lesson mirrors PEFT’s: the model is rarely the bottleneck. In RAG, retrieval quality is the system. Most “RAG doesn’t work” complaints are really “my retrieval is fetching the wrong chunks.”
The takeaway
PEFT and RAG answer two different questions. PEFT efficiently retrains a small slice of a model’s weights to change how it behaves — format, tone, skill, domain reasoning — without the cost of full fine-tuning. RAG leaves the model alone and injects current, relevant facts into its context at query time, with citations and no retraining.
Use PEFT when the gap is behavior; use RAG when the gap is knowledge; reach for RAG first because it’s cheaper and easier to keep correct; and combine them when you need a model that both acts like yours and knows what’s true right now. And in both, the model is the easy part — the dataset (for PEFT) and the retrieval (for RAG) are where the real work, and the real quality, live.