01 Semantic Search Doesn't Understand Time
RAG works by finding documents that are semantically similar to your question — similar to typing keywords into Google and receiving the most relevant matching articles. But it has no concept of recurrence or time-ordering. Ask it for a formulary position and it might surface the Q2 version when you need Q4, not because it made a logical error, but because semantic search doesn't care about dates.
Pharma contract operations data is fundamentally time-ordered and relational. A formulary from Q3 means something different than the same formulary from Q4. A chargeback error that recurs across three consecutive quarters means something different than a one-time anomaly. RAG architecture was not built to reason across these distinctions, and no amount of prompt engineering fixes that.
02 Models Don't Always Believe What They Retrieve
RAG assumes that when a model is handed retrieved evidence, it will use that evidence. In practice, this assumption breaks down — especially for facts that change frequently. Language models develop strong priors during pretraining. When retrieved information conflicts with prior knowledge, models often do one of two things: they selectively accept the parts of the retrieved evidence that align with what they already "know," or they discard the conflict entirely. This is confirmation bias.[1]
Formulary data, tier positions, and prior auth and step requirements are particularly vulnerable. The out-of-the-box model used in a POC might be trained on a publicly available formulary — like a 2024 Express Scripts Medicare PDP. When it sees the 2026 Express Scripts and the data looks different, retrieved context rarely wins cleanly. The result is answers that look authoritative but are quietly wrong.
03 Multi-Hop Reasoning Breaks
Even when retrieval works, reasoning is a separate problem. Most contract operations questions aren't single lookups — they require chaining multiple pieces of information. Agents may pull data from ERP systems, revenue management systems, contract management systems, and formulary documentation simultaneously.
This chaining is called multi-hop reasoning, and research consistently shows it is where RAG collapses.[2] In roughly 80% of documented failures, the correct evidence was already in the retrieved context — the model simply couldn't reason across it correctly.
Think of it like a game of telephone: each handoff between a sub-question and the next introduces noise, and one bad step poisons every step that follows. This is an architecture problem. The more hops, the greater the compounding risk. Without rigorous validation logic at every step, the system is quietly wrong in ways that are hard to audit and expensive to catch.
— What This Means
Pharmaceutical contract operations sits atop a tremendous pool of value — and risk. The margin for error is zero. When evaluating agentic solutions, the right questions aren't about which tool to buy. They are about validation architecture, operational memory, and whether the system was actually built for this domain.
Most weren't. And those gaps don't get patched in later.
- Kortukov et al., "Studying Large Language Model Behaviors Under Context-Memory Conflicts With Real Documents," arXiv:2404.16032
- Zarrinkia et al., "The Reasoning Bottleneck in Graph-RAG: Structured Prompting and Context Compression for Multi-Hop QA," arXiv:2603.14045