When teams first reach for fine-tuning, they're usually answering the wrong question. The right question is rarely "how do I teach a model new facts?" — it's almost always "how do I keep the model honest about facts that change?"
The shape of the problem
If your knowledge base updates once a year, fine-tuning is fine. If it updates more than once a quarter, you'll spend the rest of your career re-training. Retrieval skips that loop entirely: index the new document, ship the change in seconds.
# A minimal retrieval loop q_emb = embed(query) hits = index.search(q_emb, k=8) context = rerank(hits, query)[:3] prompt = template.format(context=context, q=query) answer = llm.complete(prompt)
What you give up
Latency, mostly — retrieval adds 80–200ms before the model starts generating. You also surrender some elegance: now you have an index, a re-ranker, and a prompt template to maintain. The tradeoff is almost always worth it. Almost.
Three places this breaks: (1) the corpus has no clean chunks — long-form contracts, e.g. (2) the queries don't look like the documents — search engines famously suffer here (3) you need the model to generalize, not recite — retrieval can't help you with that.