I was stress-testing a RAG system built for regulated industries — financial services and life sciences. The grounding was fine. No hallucinations. What I found were subtler failures — the kind that only surface when analysts run the same query twice, compare citations across sessions, and need to explain to a regulator exactly which document an answer came from.

In regulated environments, that’s the standard. And the system wasn’t meeting it.


Bug 1: The Embedding Variance Problem

My retrieval pipeline is hybrid: BM25 + TF-IDF + dense embeddings fused via Reciprocal Rank Fusion (RRF). The BM25 and TF-IDF lanes are deterministic by construction — exact string matching, frequency counts. The dense lane isn’t.

I’m running Ollama locally with nomic-embed-text. I discovered that calling the embedding API twice with identical input text returns slightly different vectors — different enough that the floating-point ordering of candidates shifts.

The cascade is subtle but real: same query → slightly different dense vector → different dense lane rankings → different RRF fusion candidates → different cross-encoder candidates → different final passages. The analyst gets different citations on the same question without understanding why.

The fix required two cooperating caches. For corpus chunks, I added a disk-persisted cache keyed by a SHA-256 fingerprint of all chunk texts plus the model name — vectors load from disk on match, re-embed on mismatch. For query embeddings, I added an in-memory dict on the retriever instance — same query string returns the same vector within a session.

Then I found one more: even with identical vectors, the RRF fusion sort was non-deterministic for tied scores. A float comparison between two 0.016667 values has no guaranteed stable order. I added a secondary sort key — chunk ID — so ties always break the same way.

These changes are not just dev scaffolding. The cache enforces a contract my application controls — the embedding for a given text is a constant, immutable fact within this system, regardless of what the provider does underneath. Even cloud providers can shift behavior across model updates or backend changes. The consistency bug you can reproduce in dev is the one you can actually fix. The one that only appears in prod is the one that erodes analyst trust before you understand what’s happening.


Bug 2: The Gate That Always Said Yes

I had a function called is_sufficient() in my search agent. Its job was to evaluate whether retrieved evidence was good enough to warrant synthesis — a gate between retrieval and LLM call.

def is_sufficient(self, evidence: Evidence) -> bool:
    return evidence.total_retrieved >= 1 and evidence.passages[0].score > self.min_score

The problem: retrieve() had already filtered out every passage below min_score. By the time is_sufficient() ran, every passage in the evidence object already had a score above min_score by definition. The check always returned True whenever any passage was retrieved at all.

A function that always returns True isn’t a gate. It’s a comment.

The fix: introduce a second threshold — EVIDENCE_QUALITY_THRESHOLD — set higher than the noise filter (MIN_SCORE * 3 by default). The two thresholds have distinct jobs: one filters garbage out of the index, the other certifies that what survived is actually worth synthesizing. Conflating them was the root of the bug.

That fixed the signal. But the gate was in the wrong place.


Bug 3: The Synthesis Layer Had No Gate at All

Even with is_sufficient() fixed, synthesis had no pre-check of its own. The agent could call synthesize() directly — bypassing the gate entirely — and synthesis would proceed on whatever evidence it received, no matter how weak.

This matters because a confident-sounding answer with a confidence: low flag buried in the metadata is as dangerous as a hallucination in a compliance context. A compliance officer under deadline pressure may not check the metadata before acting on the answer.

The right fix isn’t a softer answer. It’s no answer.

I added a hard gate at the top of synthesize(). Before any LLM call, check the top passage score:

if top_score < EVIDENCE_QUALITY_THRESHOLD:
    return SynthesisResult(
        answer=f"Evidence quality too low to synthesize (top score {top_score:.4f} < "
               f"threshold {EVIDENCE_QUALITY_THRESHOLD:.4f}). "
               "Try a more specific query or check corpus coverage.",
        grounded=False,         # audit signal: this answer is not grounded in retrieved evidence
        confidence="insufficient",
    )

No tokens spent. No LLM call. The analyst gets a diagnostic message — exact scores, what to try next — rather than a misleading answer they might act on.

Bug 2 was a failure of validation logic — checking a value that had already been pre-filtered. Bug 3 was a failure of architectural enforcement — the synthesis layer had no gate of its own, so the validation fix in Bug 2 could be bypassed entirely. Same theme, two different layers of failure.


Bug 4: The Score Scale Mismatch

My hybrid retriever uses RRF fusion scores in the range of roughly 0.001 to 0.05. When I enable the cross-encoder reranker (ms-marco-MiniLM-L-6-v2), scores become raw logits — unbounded values roughly between -10 and +10.

Thresholds calibrated for RRF scores (EVIDENCE_QUALITY_THRESHOLD = 0.003) are meaningless against cross-encoder logits. A logit of 0.003 is essentially zero on that scale.

I didn’t normalize scores to a common scale — this is configuration as documentation, and that choice is worth explaining. A normalization layer would have to know which scoring mode is active at threshold-check time, and update correctly every time a new reranker is added. The retrieval backend in this system is configurable — normalizing across modes adds maintenance burden for a problem that only surfaces when you switch backends, and in production you don’t. The right fix is to make the calibration requirement visible at configuration time, not hide it behind a normalization that could silently be wrong for the next backend you try.

So the .env.example now has this:

EVIDENCE_QUALITY_THRESHOLD=0.003  # Min top-passage score for synthesis to proceed
                                   # BM25/TF-IDF: ~0.003; cross-encoder: 0.0 to 2.0

That comment is load-bearing. Whoever deploys this system sees the calibration requirement before they configure it, not after they’ve been confused by the behavior.


Four Gaps in the Trust Layer

These four bugs have a pattern. None of them broke the system in a way that would show up in a demo. The system retrieved passages, called synthesis, returned answers with citations. From the outside, it looked correct.

What was broken was the contract with the analyst.

In a regulated environment, an analyst may need to explain to a regulator exactly which passage an answer came from, on a specific date. When the system returns different passages for the same query on different days, the analyst can’t tell if the answer changed because the corpus changed or because the system is unreliable. They stop trusting it.

The retrieval engine is your contract with the analyst. The LLM is just the renderer.

Once the retrieval layer is deterministic and the evidence gate is doing real work, switching from local Ollama to Claude in production is literally a .env change. That design only pays off if what’s underneath it is solid. These bugs were the work of making it solid.