[{"content":"I was building a GPU recommendation engine — one that maps workload descriptions to specific configurations, primary recommendations, and cost ranges — and kept hitting the same wall: getting the recommendations right meant going deep on every constraint that determines whether a deployment actually works. Not whether it\u0026rsquo;s affordable. Whether it works at all.\nVRAM has to fit the full training state, not just the model weights. Training data has to be where the GPUs are. The interconnect has to support the parallelism strategy. None of that shows up in a $/hr comparison. Here are the five calculations that come before it.\n1. VRAM Fit: The Hard Constraint Before Everything Else The first thing I had to establish was whether the model fits on the GPU at all. If VRAM is exhausted, training crashes — not degrades, crashes. The full memory envelope is what matters: model weights plus optimizer state plus activations plus the KV cache for inference.\nModel weights are the baseline. The per-parameter byte cost depends entirely on your training mode:\nMode Bytes per parameter Full fine-tuning (mixed precision: BF16 compute + FP32 optimizer states, Adam) 18 bytes LoRA (BF16 base frozen + adapter trained in BF16 + Adam optimizer) 2 bytes base + ~6–12 GB adapter overhead QLoRA (NF4 quantized base: 0.5 bytes/param + LoRA adapter + paged Adam overhead) ~0.5 bytes base + ~5–15 GB adapter overhead BF16 inference (weights only, no optimizer) 2 bytes A 70B parameter model under full fine-tuning requires roughly 1.26 TB of VRAM for the full training state. In practice, this is distributed across multiple GPUs using ZeRO-3 (DeepSpeed) or FSDP, which shard optimizer states and gradients across devices — a single GPU no longer holds the full 1.26 TB, but the aggregate VRAM requirement across the cluster is unchanged. The number still determines how many nodes you need; it\u0026rsquo;s just not the per-GPU ceiling when you use modern sharding.\nUnder QLoRA, the NF4-quantized base model weighs about 35 GB, with LoRA adapter parameters and paged optimizer adding roughly 5–15 GB. Real-world peak VRAM during training typically lands between 45–65 GB depending on LoRA rank and batch size — comfortably within a single A100 80GB node in most configurations. That\u0026rsquo;s the difference between a multi-node cluster and a single node.\nKV cache for inference adds memory that scales with sequence length, not model size:\nKV cache = 2 × num_layers × num_kv_heads × head_dim × seq_len × 2 bytes Modern large models use Grouped Query Attention (GQA), where num_kv_heads is much smaller than the total attention head count — LLaMA-70B uses 8 KV heads vs. 64 query heads, reducing KV cache by 8×. For a 70B model at 8K context with 80 layers, 8 KV heads, and 128 head_dim, that\u0026rsquo;s roughly 2.5 GB per sequence in flight. At a batch of 10 concurrent requests, that\u0026rsquo;s 25 GB of KV cache before a single weight is loaded.\nActivations assume gradient checkpointing. Without it, activation memory can exceed weight memory on long sequences.\nThe VRAM calculation isn\u0026rsquo;t hard math. It\u0026rsquo;s knowing which formula applies to which mode. Skip it and the $/hr number becomes irrelevant at job launch time.\n2. Quantization Changes the Hardware Requirement Entirely Quantization isn\u0026rsquo;t a fine-tuning detail. It\u0026rsquo;s a hardware selection variable. The same model at different precision levels requires completely different GPU configurations.\nPrecision Bytes per parameter 70B model footprint FP32 4 280 GB BF16 2 140 GB INT8 1 70 GB INT4 0.5 35 GB Going from FP32 to INT4 on a 70B model reduces the weight footprint by 8×. That\u0026rsquo;s the difference between a four-card A100 configuration (4 × 80 GB = 320 GB total VRAM) and a single H100 (80 GB) — for the exact same model.\nThe catch is precision loss. INT4 (via GPTQ or GGUF) can introduce measurable accuracy loss on tasks that require numerical precision — financial calculations, structured output with tight constraints, legal reasoning where exact phrasing matters. From what I found in my research, INT8 generally holds up well for inference with minor accuracy degradation. For training, BF16 is the standard — it has the same exponent range as FP32 without the memory cost.\nQuantization and hardware selection turned out to be the same decision.\n3. Multi-Node Is Not Just More GPUs Once the model doesn\u0026rsquo;t fit on a single node, the architecture changes — and not just because there are more GPUs.\nThe threshold is simple: if required VRAM exceeds single-node capacity, you need multi-node. The ceiling division tells you how many nodes:\nnodes_required = ceil(total_vram_required / vram_per_node) But that calculation only tells you how many nodes — not whether they can communicate fast enough to make distributed training work.\nInside a single node, GPUs communicate over NVLink — NVIDIA\u0026rsquo;s proprietary interconnect running at 900 GB/s on H100 NVLink 4.0 (600 GB/s on A100 NVLink 3.0). Tensor parallelism (splitting a single layer across multiple GPUs) is viable at this bandwidth. You can shard individual weight matrices horizontally across GPUs and the synchronization cost is low enough to be worth it.\nAcross nodes, the story changes. You\u0026rsquo;re on InfiniBand or Ethernet. InfiniBand HDR gives you about 200 Gb/s (25 GB/s) per port — more than an order of magnitude slower than NVLink. At this bandwidth, tensor parallelism across nodes is usually a net loss. The synchronization overhead exceeds the compute benefit. You switch to pipeline parallelism instead — each node holds complete model layers, and the forward pass flows through nodes sequentially.\nIf your model requires cross-node tensor parallelism, InfiniBand isn\u0026rsquo;t optional — it\u0026rsquo;s a correctness requirement, not a performance preference.\nThe mechanism that synchronizes multi-GPU training is NCCL — NVIDIA Collective Communications Library. NCCL handles AllReduce operations: after each backward pass, every GPU has computed its local gradients, and NCCL aggregates them across all GPUs so each one updates with the full gradient. The abstraction is clean. The failure modes aren\u0026rsquo;t.\nNCCL misconfiguration doesn\u0026rsquo;t crash training. It silently degrades throughput — sometimes to 20% of expected performance — because the collective operations serialize where they should be parallel. The symptom looks like slow hardware. From what I found in my research, the usual cause is a wrong NCCL_SOCKET_IFNAME environment variable pointing at the management network instead of the high-speed fabric, or a topology the auto-detection logic didn\u0026rsquo;t handle correctly. If multi-node training is slower than single-node extrapolation would predict, NCCL environment variables are the first place to look.\nWhat I\u0026rsquo;d want answered before signing a multi-node contract:\nGPUs per node, by GPU type Interconnect type and generation (InfiniBand HDR/NDR, or Ethernet) Whether InfiniBand is enabled per-node or cluster-wide NVLink topology within a node 4. The Egress Line Item You Didn\u0026rsquo;t Quote Moving to a new GPU provider doesn\u0026rsquo;t just mean moving your workload. It means moving your data. If your training dataset lives in AWS S3 and you\u0026rsquo;re training on a bare metal GPU provider, you\u0026rsquo;re paying egress on every training run that reads from S3.\nThe formula is mechanical:\nmonthly_egress_cost = dataset_size_GB × training_runs_per_month × egress_rate_per_GB AWS egress is tiered: first 100 GB/month free, then $0.09/GB up to 10 TB, dropping to $0.085/GB and lower at higher volumes. GCP and Azure are similar. In practice, $0.09/GB is the operative rate for iterative training experiments. A 500 GB dataset running 20 training experiments per month is $900/month in egress — before a single GPU-hour. Caching training data locally on the GPU provider after the first pull, or using a same-cloud provider, eliminates this cost entirely. The formula tells you whether it\u0026rsquo;s worth solving.\nThe egress number is exact arithmetic. It was the last thing I added to the recommendation tool — and nearly the last thing I would have thought to include.\nBeyond egress cost, migration friction has a qualitative dimension. Not all cloud dependencies detach cleanly:\nHigh friction — services with logic baked in, not just data stored. SageMaker endpoints embed training pipelines that aren\u0026rsquo;t portable. EKS/GKE workloads carry IAM policies, cluster autoscalers, and networking rules that reference cloud-specific primitives. Moving these isn\u0026rsquo;t a copy operation — it\u0026rsquo;s a rewrite.\nMedium friction — data dependencies where the data is portable but moving it takes time and money. S3 training data with a fixed egress cost. CloudWatch dashboards that need to be rebuilt elsewhere. These have a price you can calculate.\nLow friction — ECR container images, exported model weights, raw datasets in open formats. These move freely.\nUnderstanding which category each dependency falls into tells you the real migration cost — both the one-time move and the ongoing egress that doesn\u0026rsquo;t go away after you\u0026rsquo;ve moved.\n5. TCO Has Four Rows, Not One True total cost of ownership for GPU infrastructure is four numbers:\nComponent Precision Compute Exact — GPU-hours × hourly rate Storage Exact — GB-months × storage rate Egress Exact — dataset size × runs × egress rate Managed services premium Estimated — SageMaker carries roughly 30% overhead vs self-managed Compute is the number everyone starts with. GPU-hours is the right unit — calculated from the training compute formula:\ngpu_hours = (6 × parameters × dataset_tokens × epochs) / gpu_flops This is derived from the standard estimate that training a transformer requires approximately 6 multiply-accumulate operations per parameter per token (forward pass + backward pass). GPU FLOPS are published in the vendor spec sheet — H100 SXM5 delivers 989 TFLOPS of BF16 tensor core throughput (dense; 1,979 TFLOPS with structured sparsity — vendors sometimes quote the sparsity figure, so verify which number you\u0026rsquo;re comparing against). The math is exact — verifiable against your engineering team\u0026rsquo;s own estimates.\nStorage is exact given a provider\u0026rsquo;s storage rate. It\u0026rsquo;s easy to compare compute rates from one provider against storage rates from another, or skip storage entirely. At 500 GB of training data plus checkpoints plus output artifacts, storage is not a rounding error.\nEgress is exactly the formula from the previous section. It\u0026rsquo;s often the surprise line item — paid on every training run for the life of the contract if your data stays in the source cloud.\nManaged services is the one honest estimate. SageMaker, Azure ML, and Vertex AI abstract away cluster management, autoscaling, and experiment tracking — at roughly 30% over bare metal compute. The point is to see it explicitly, not buried in a blended rate.\nA bare metal provider at $2.50/hr looks cheap until you add $900/month in egress and the cost of managing your own cluster. SageMaker at $3.50/hr looks expensive until you account for what you\u0026rsquo;re not building. Neither is universally correct. Both are visible when you run all four rows.\nExample: All Five Calculations on a Real Scenario Setup: Fine-tuning LLaMA-3 70B on a 500 GB proprietary corpus. Training mode: QLoRA. Two candidates — a bare metal H100 80GB provider at $2.80/hr and AWS SageMaker. Current data location: AWS S3.\nCalculation 1 — VRAM fit\nNF4-quantized base: 70B × 0.5 bytes = 35 GB. LoRA adapter + paged Adam: ~12 GB. Peak: ~47 GB. One H100 80GB clears it with 33 GB to spare. No further VRAM analysis needed.\nFor reference: full fine-tuning on the same model (18 bytes/param × 70B = 1,260 GB aggregate) would require two eight-GPU H100 nodes under ZeRO-3 sharding — a completely different procurement.\nCalculation 2 — Quantization decision\nQLoRA (NF4) locks in a single-node configuration. If you later serve the trained adapter merged into the base model at BF16 for inference: 140 GB → two 80GB GPUs. That\u0026rsquo;s a separate hardware decision for the serving layer.\nCalculation 3 — Multi-node\n47 GB peak \u0026lt; 80 GB node capacity. No multi-node. No InfiniBand questions to ask. No NCCL configuration. If the mode were full fine-tuning: ceil(1,260 / 80) = 16 GPUs minimum — two nodes, InfiniBand required, NCCL setup non-trivial. The quantization decision from step 2 made this a non-issue.\nCalculation 4 — Egress\nTraining data lives in S3; bare metal provider is not AWS.\nOne-time data pull: 500 GB × $0.09 = $45 Per-run if re-reading from S3: $45/run 20 training experiments, each pulling fresh: $900 total Cache the dataset on the provider after the first pull: $45 total Whether you cache is a workflow decision that changes the egress line by $855. Write it down.\nCalculation 5 — TCO\nGPU-hours: (6 × 70B params × 1B training tokens × 3 epochs) ÷ 989 TFLOPS = 354 GPU-hours\nComponent Bare metal H100 SageMaker Compute $991 (354 hr × $2.80) $1,288 (+30%) Storage (corpus + checkpoints + artifacts) $28/month $28/month Egress — data cached after first pull $45 $0 (same-cloud) Egress — 20 runs, pulling from S3 each time $900 $0 (same-cloud) Total — data cached $1,064 $1,316 Total — 20 runs uncached $1,919 $1,316 Bare metal wins by $252 if you pull data once and cache it on the provider. SageMaker wins by $603 if you run 20 experiments pulling fresh from S3 each time.\nNeither answer is universally correct. Both are calculable before you sign anything.\nWhat This Changes Working through these to build the recommendation engine made clear why $/hr comparisons fall short — the number is real, but it doesn\u0026rsquo;t carry information about whether the configuration works, what the data movement costs, or whether InfiniBand is available for the multi-node case.\nBefore I could compare an H100 at $2.80/hr against an A100 at $1.60/hr, I needed to know whether my model fit in a single A100 node, what egress looked like, and whether InfiniBand was available. That\u0026rsquo;s what these five calculations gave me.\nThese calculations tell you whether a configuration works. They don\u0026rsquo;t tell you which configurations to put on the table in the first place — that\u0026rsquo;s the problem I built the tool to solve.\nI\u0026rsquo;ve been building GTM intelligence tools — the GPU Advisor is one of them. It runs these calculations against your actual workload profile and generates a report with actionable recommendations. Happy to show you a live demo. Reach out on LinkedIn.\n","permalink":"https://aitechy226.github.io/posts/gpu-infrastructure-five-calculations/","summary":"\u003cp\u003eI was building a GPU recommendation engine — one that maps workload descriptions to specific configurations, primary recommendations, and cost ranges — and kept hitting the same wall: getting the recommendations right meant going deep on every constraint that determines whether a deployment actually works. Not whether it\u0026rsquo;s affordable. Whether it works at all.\u003c/p\u003e\n\u003cp\u003eVRAM has to fit the full training state, not just the model weights. Training data has to be where the GPUs are. The interconnect has to support the parallelism strategy. None of that shows up in a $/hr comparison. Here are the five calculations that come before it.\u003c/p\u003e","title":"GPU Infrastructure: The Five Calculations That Actually Matter"},{"content":"I was stress-testing a RAG system built for regulated industries — financial services and life sciences. The grounding was fine. No hallucinations. What I found were subtler failures — the kind that only surface when analysts run the same query twice, compare citations across sessions, and need to explain to a regulator exactly which document an answer came from.\nIn regulated environments, that\u0026rsquo;s the standard. And the system wasn\u0026rsquo;t meeting it.\nBug 1: The Embedding Variance Problem My retrieval pipeline is hybrid: BM25 + TF-IDF + dense embeddings fused via Reciprocal Rank Fusion (RRF). The BM25 and TF-IDF lanes are deterministic by construction — exact string matching, frequency counts. The dense lane isn\u0026rsquo;t.\nI\u0026rsquo;m running Ollama locally with nomic-embed-text. I discovered that calling the embedding API twice with identical input text returns slightly different vectors — different enough that the floating-point ordering of candidates shifts.\nThe cascade is subtle but real: same query → slightly different dense vector → different dense lane rankings → different RRF fusion candidates → different cross-encoder candidates → different final passages. The analyst gets different citations on the same question without understanding why.\nThe fix required two cooperating caches. For corpus chunks, I added a disk-persisted cache keyed by a SHA-256 fingerprint of all chunk texts plus the model name — vectors load from disk on match, re-embed on mismatch. For query embeddings, I added an in-memory dict on the retriever instance — same query string returns the same vector within a session.\nThen I found one more: even with identical vectors, the RRF fusion sort was non-deterministic for tied scores. A float comparison between two 0.016667 values has no guaranteed stable order. I added a secondary sort key — chunk ID — so ties always break the same way.\nThese changes are not just dev scaffolding. The cache enforces a contract my application controls — the embedding for a given text is a constant, immutable fact within this system, regardless of what the provider does underneath. Even cloud providers can shift behavior across model updates or backend changes. The consistency bug you can reproduce in dev is the one you can actually fix. The one that only appears in prod is the one that erodes analyst trust before you understand what\u0026rsquo;s happening.\nBug 2: The Gate That Always Said Yes I had a function called is_sufficient() in my search agent. Its job was to evaluate whether retrieved evidence was good enough to warrant synthesis — a gate between retrieval and LLM call.\ndef is_sufficient(self, evidence: Evidence) -\u0026gt; bool: return evidence.total_retrieved \u0026gt;= 1 and evidence.passages[0].score \u0026gt; self.min_score The problem: retrieve() had already filtered out every passage below min_score. By the time is_sufficient() ran, every passage in the evidence object already had a score above min_score by definition. The check always returned True whenever any passage was retrieved at all.\nA function that always returns True isn\u0026rsquo;t a gate. It\u0026rsquo;s a comment.\nThe fix: introduce a second threshold — EVIDENCE_QUALITY_THRESHOLD — set higher than the noise filter (MIN_SCORE * 3 by default). The two thresholds have distinct jobs: one filters garbage out of the index, the other certifies that what survived is actually worth synthesizing. Conflating them was the root of the bug.\nThat fixed the signal. But the gate was in the wrong place.\nBug 3: The Synthesis Layer Had No Gate at All Even with is_sufficient() fixed, synthesis had no pre-check of its own. The agent could call synthesize() directly — bypassing the gate entirely — and synthesis would proceed on whatever evidence it received, no matter how weak.\nThis matters because a confident-sounding answer with a confidence: low flag buried in the metadata is as dangerous as a hallucination in a compliance context. A compliance officer under deadline pressure may not check the metadata before acting on the answer.\nThe right fix isn\u0026rsquo;t a softer answer. It\u0026rsquo;s no answer.\nI added a hard gate at the top of synthesize(). Before any LLM call, check the top passage score:\nif top_score \u0026lt; EVIDENCE_QUALITY_THRESHOLD: return SynthesisResult( answer=f\u0026#34;Evidence quality too low to synthesize (top score {top_score:.4f} \u0026lt; \u0026#34; f\u0026#34;threshold {EVIDENCE_QUALITY_THRESHOLD:.4f}). \u0026#34; \u0026#34;Try a more specific query or check corpus coverage.\u0026#34;, grounded=False, # audit signal: this answer is not grounded in retrieved evidence confidence=\u0026#34;insufficient\u0026#34;, ) No tokens spent. No LLM call. The analyst gets a diagnostic message — exact scores, what to try next — rather than a misleading answer they might act on.\nBug 2 was a failure of validation logic — checking a value that had already been pre-filtered. Bug 3 was a failure of architectural enforcement — the synthesis layer had no gate of its own, so the validation fix in Bug 2 could be bypassed entirely. Same theme, two different layers of failure.\nBug 4: The Score Scale Mismatch My hybrid retriever uses RRF fusion scores in the range of roughly 0.001 to 0.05. When I enable the cross-encoder reranker (ms-marco-MiniLM-L-6-v2), scores become raw logits — unbounded values roughly between -10 and +10.\nThresholds calibrated for RRF scores (EVIDENCE_QUALITY_THRESHOLD = 0.003) are meaningless against cross-encoder logits. A logit of 0.003 is essentially zero on that scale.\nI didn\u0026rsquo;t normalize scores to a common scale — this is configuration as documentation, and that choice is worth explaining. A normalization layer would have to know which scoring mode is active at threshold-check time, and update correctly every time a new reranker is added. The retrieval backend in this system is configurable — normalizing across modes adds maintenance burden for a problem that only surfaces when you switch backends, and in production you don\u0026rsquo;t. The right fix is to make the calibration requirement visible at configuration time, not hide it behind a normalization that could silently be wrong for the next backend you try.\nSo the .env.example now has this:\nEVIDENCE_QUALITY_THRESHOLD=0.003 # Min top-passage score for synthesis to proceed # BM25/TF-IDF: ~0.003; cross-encoder: 0.0 to 2.0 That comment is load-bearing. Whoever deploys this system sees the calibration requirement before they configure it, not after they\u0026rsquo;ve been confused by the behavior.\nFour Gaps in the Trust Layer These four bugs have a pattern. None of them broke the system in a way that would show up in a demo. The system retrieved passages, called synthesis, returned answers with citations. From the outside, it looked correct.\nWhat was broken was the contract with the analyst.\nIn a regulated environment, an analyst may need to explain to a regulator exactly which passage an answer came from, on a specific date. When the system returns different passages for the same query on different days, the analyst can\u0026rsquo;t tell if the answer changed because the corpus changed or because the system is unreliable. They stop trusting it.\nThe retrieval engine is your contract with the analyst. The LLM is just the renderer.\nOnce the retrieval layer is deterministic and the evidence gate is doing real work, switching from local Ollama to Claude in production is literally a .env change. That design only pays off if what\u0026rsquo;s underneath it is solid. These bugs were the work of making it solid.\n","permalink":"https://aitechy226.github.io/posts/enterprise-rag-trust-layer/","summary":"\u003cp\u003eI was stress-testing a RAG system built for regulated industries — financial services and life sciences. The grounding was fine. No hallucinations. What I found were subtler failures — the kind that only surface when analysts run the same query twice, compare citations across sessions, and need to explain to a regulator exactly which document an answer came from.\u003c/p\u003e\n\u003cp\u003eIn regulated environments, that\u0026rsquo;s the standard. And the system wasn\u0026rsquo;t meeting it.\u003c/p\u003e","title":"The Trust Layer: What Separates Good RAG from Enterprise RAG"},{"content":"Over the last 18 months I have been in a lot of conversations about AI PCs — with enterprises evaluating fleet upgrades, with device vendors making the case for their hardware, and with IT leaders trying to figure out what their employees actually need.\nThe consistent signal: everybody agrees AI PCs matter. Purchases are happening — Windows 10 end-of-support has accelerated that — but the buying is cautious and uneven. Two reasons come up every time: the AI landscape is moving fast enough that enterprises are not confident their requirements will look the same in 12 months, and they do not have a reliable way to evaluate what they are being sold. Most are not even sure what the right criteria should be.\nThat second problem is what this post is about.\n\u0026ldquo;AI PC\u0026rdquo; Is Doing Too Much Work as a Category The term means completely different things depending on who you are buying for. Most enterprises are buying for several different people at once.\nFor knowledge workers, the AI PC is an efficiency question: does the device stay out of the way? Does the NPU keep the CPU free during a Teams call? Does battery life hold when all the ambient AI features are running? They are not running local models. The hardware\u0026rsquo;s job is to handle that background AI load without degrading the experience.\nFor developers building AI applications, it is a capability question. The hardware question is whether the machine can actually serve inference at the scale their application demands: low time to first token, enough memory to load the model cleanly, enough capacity to handle concurrent requests without stalling. A multi-agent pipeline doing RAG across several simultaneous queries, or an agentic workflow fanning out to multiple sub-calls, will expose headroom limits that a single sequential request never will.\nThen there is a group in between: teams with sensitive data who cannot send it to a cloud API. A legal team that wants document summarization without shipping contracts to an external endpoint. A finance team running local search over internal reports. They need local inference capability but they are not developers. They are knowledge workers with a data privacy constraint.\nThree groups. Three different hardware profiles. One procurement decision.\nAnd real users do not fit cleanly into any one of them. The analyst who lives in Teams and occasionally needs to run a heavy workload is a real person in every enterprise. When the evaluation framework is built around clean personas, it misses the people who actually use the hardware.\nThe Stakes Make This Hard to Get Wrong Enterprise hardware is not bought annually. A fleet decision today locks the organization into that hardware for at least three years, sometimes four or five.\nThe AI-enabled premium — the difference between a standard enterprise laptop and a Copilot+ or higher-spec machine — typically runs $300 to $800 per device. That delta often bundles higher RAM, storage, and display specs alongside the NPU, so not all of it is purely AI-driven. Across a 10,000-seat fleet, that is $3M to $8M in incremental spend before support contracts and deployment costs, locked in for three years.\nMultiple user profiles with different needs, a commitment of that size, and a multi-year lock-in — that is exactly the kind of decision that makes intelligent people hesitate. Not because they do not understand the technology. Because they do not have the right instrument to evaluate it.\nThe Vendor Metrics Are Real. They Are Just Answering a Different Question. I want to be precise about this because it is easy to make it sound like vendor criticism. That is not the point.\nWhen I say vendor here, I am mostly talking about device vendors — Dell, HP, Lenovo. These are the conversations I have actually been in. A device vendor walks into an enterprise with a comparison deck: here is our Copilot+ configuration, here is what the NPU delivers, here is how it stacks up against the previous generation. Those NPU claims — the TOPS ratings — originate with the chip makers: Qualcomm, Intel, AMD. The device OEMs repackage them into their sales narrative. The numbers are accurate at every step. The benchmarks ran. Everyone in the chain is doing exactly what they should do: making the case for their hardware in a competitive market.\nThe problem is structural. Those benchmarks are designed to differentiate a product. They are not designed to answer whether a specific organization\u0026rsquo;s specific workload mix will benefit from the premium it is being asked to pay. The vendor knows their device. They do not know your workload.\nIn practice, the conversation always hit the same wall. The vendor deck had impressive numbers. The IT leader had no way to connect those numbers to the actual question: will this hardware deliver measurably better outcomes for my people, at this price, for the next three years?\nThat question does not have a vendor-supplied answer. It requires a different kind of measurement entirely.\nWhat I Started Building I came at this originally from the developer side — personally interested in evaluating devices for running LLMs locally, often experimenting with different quantization levels. I wanted to know whether a given machine could actually handle a local inference workload. Not just load a model, but serve it under the kind of concurrent load an agentic application generates. The tool I built measures that: parallel request throughput, latency distribution, where memory pressure appears, how performance holds under sustained load.\nWhile that tool answers the developer question well, I know enterprises need to answer a different question: whether the AI-enabled premium would pay off for the thousands of knowledge workers who would never run a local model, but whose daily experience would depend on whether the NPU actually delivered on its promise during a Teams call.\nTokens per second does not tell you anything about CPU headroom during a video call. CPU headroom during a video call does not tell you anything about how an agent pipeline behaves when three requests arrive simultaneously. The tools I have been building are designed around that reality — one for each use case, not a one-size-fits-all benchmark.\nThe developer-side tool is production-grade and running today on Apple Silicon. The knowledge worker fleet evaluation tool is designed and in progress.\nThe next post goes into the developer-side tool in detail — the design decisions, what the implementation revealed, and where the obvious approach turned out to measure the wrong thing.\nI focus on building production-grade AI systems, from agentic pipelines to inference infrastructure. If you are working through an AI PC evaluation as an IT leader or a developer benchmarking hardware for local inference, I am happy to share what I have built and the criteria I have settled on. Reach out on LinkedIn.\n","permalink":"https://aitechy226.github.io/posts/benchmarking-ai-devices/","summary":"\u003cp\u003eOver the last 18 months I have been in a lot of conversations about AI PCs — with enterprises evaluating fleet upgrades, with device vendors making the case for their hardware, and with IT leaders trying to figure out what their employees actually need.\u003c/p\u003e\n\u003cp\u003eThe consistent signal: everybody agrees AI PCs matter. Purchases are happening — Windows 10 end-of-support has accelerated that — but the buying is cautious and uneven. Two reasons come up every time: the AI landscape is moving fast enough that enterprises are not confident their requirements will look the same in 12 months, and they do not have a reliable way to evaluate what they are being sold. Most are not even sure what the right criteria should be.\u003c/p\u003e","title":"The AI PC Buying Problem Every Enterprise Needs to Solve"},{"content":" MCP in Production \u0026middot; Part 1 of 2 Part 2: Authentication, Observability, and Operational Design → Most MCP client examples open a session, call a tool, and close the session. That pattern is fine for demos. It breaks in production in ways that aren\u0026rsquo;t obvious until you\u0026rsquo;re staring at a hung process or a spike in latency.\nThis is Part 1 of a two-part series on what it takes to run an MCP client reliably. I\u0026rsquo;ll cover the transport layer: sessions, pooling, dead connection recovery, timeouts, and the heartbeat. Part 2 covers the system layer: authentication, observability, and operational design.\nThe context is a KYC Onboarding Orchestrator I built that calls four MCP servers — Moody\u0026rsquo;s entity data, sanctions/PEP screening, CRM, and document generation — for every onboarding case. Each case makes 8–10 tool calls. Every decision below was made in response to something that actually broke.\nPer-Call Connections Don\u0026rsquo;t Scale I started with the naive design: open a fresh session per tool call, run the tool, close the session. Simple, stateless, nothing to manage.\nThe problem: every call pays full session establishment cost — TCP connect, MCP initialize handshake, then the actual tool call. Inside a Docker network that\u0026rsquo;s a few milliseconds. Against a real Moody\u0026rsquo;s or Refinitiv endpoint over the internet, it\u0026rsquo;s 3–4 RTTs of latency on every single call, multiplied across 8–10 calls per case. There\u0026rsquo;s also a server-side cost — every initialize allocates session state that is immediately discarded. I wasn\u0026rsquo;t prototyping anymore, so I needed to fix this.\nThe fix is obvious: keep sessions open and reuse them.\nDecision 1: A Pool, Not a Single Session My first instinct was one persistent session per server. That fixes the overhead problem but creates a new one: the session becomes a single point of failure. If it dies while a tool call is in flight, everything blocks until the dead session is detected and replaced.\nA pool tolerates one dead session without impacting callers on other sessions. I went with a pool of two per server — enough to survive one failure without blocking, without flooding a small server with idle connections.\nOne design detail I\u0026rsquo;m glad I got right upfront: the pool fill uses a partial-failure policy. If the server is reachable but only opens one session successfully, the caller gets that one session and proceeds. A pool of one is still a pool. Failing hard when you could degrade gracefully is the wrong call.\nRound-robin selection across the pool also means a dead session gets discovered within at most two calls — not after some arbitrary delay.\nDecision 2: Evict the Session, Not the Server When a server restarts, the pool holds stale session objects pointing at dead TCP connections. The next call fails. My first implementation evicted the entire pool entry for that server — which worked fine with one session, but caused a problem under a pool.\nIf two sessions to the same server die simultaneously, two concurrent callers both try to evict. If eviction removes the whole server entry, the second caller finds nothing and may open duplicate sessions. The correct fix is to evict the specific dead session and guard against the second caller — if the session is already gone by the time the second caller tries, just return cleanly.\nOne thing that bit me here: the transport layer fires its own cancellation when a connection dies, and asyncio.CancelledError is a BaseException, not an Exception. My original eviction handler caught Exception — which meant the cancellation escaped the handler and the session was never removed from the pool. Adding asyncio.CancelledError to the catch fixed it. Obvious in hindsight, not obvious when you\u0026rsquo;re looking at a session that should be evicted but isn\u0026rsquo;t.\nDecision 3: Timeout + Exponential Backoff Two more failure modes I hit that needed explicit fixes:\nIn-flight hang. When docker stop kills a container, the network namespace stays alive but nothing is listening. A tool call in flight at that moment waits for a TCP response that will never arrive. Without a timeout, the asyncio task blocks indefinitely and the case stays stuck. A per-call timeout bounds every tool call — when it fires, the session is evicted and the retry cycle begins.\nThundering herd on restart. I originally used a fixed retry delay. Under concurrency, every failed caller wakes up at exactly the same moment and hammers the recovering server simultaneously. The fix is exponential backoff with jitter:\ndelay = min(BASE * 2^attempt + random(0, 1), MAX) The jitter — that random(0, 1) — is the part that actually matters. It desynchronizes callers so the recovering server sees a trickle rather than a burst. I expose the base and max as environment variables so they can be tuned per deployment without touching code.\nDecision 4: The Heartbeat Probe Must Test the Real Code Path The reactive eviction handles failures after they hit. I added a background heartbeat to catch them proactively — a task that pings every session in the pool on a fixed interval and evicts any that fail before real traffic reaches them.\nThe probe design mattered more than I expected. My first version used list_tools() as the ping. It seemed reasonable — if the server can respond to a discovery call, it\u0026rsquo;s alive. But list_tools() is never called in production. My graph nodes call tools by hardcoded name. I was testing an operation that didn\u0026rsquo;t exist in the real flow.\nI replaced it with a dedicated ping tool on each server that returns \u0026quot;pong\u0026quot; immediately with no I/O. The heartbeat calls it through the exact same function every production tool call uses. If that path is broken, the session gets evicted. Testing the wrong path gives you false confidence.\nThe ping also needs its own timeout. When a container is gracefully stopped, Docker keeps the network namespace alive but stops accepting connections — without a timeout, the ping hangs forever. Five seconds is enough because the ping tool has no I/O. If a no-I/O tool takes more than 5 seconds to respond, the server has a real problem.\nThe Prerequisite I Almost Missed A heartbeat only works if the server can respond to a ping while another tool is in progress. FastMCP runs on uvicorn — a single async event loop. If any tool function makes a synchronous I/O call, it blocks the entire loop. My document generation server originally used the synchronous Anthropic client for LLM calls. A 15-second LLM call meant the heartbeat ping waited 15 seconds and timed out, falsely evicting a healthy session.\nSwitching to the async Anthropic client fixed it. This isn\u0026rsquo;t a performance optimization — it\u0026rsquo;s a correctness requirement for any async server making I/O calls.\nSummary Each decision came from something that broke:\nProblem Fix Per-call handshake overhead Persistent session pool Single session = single point of failure Pool with round-robin selection Dead sessions block callers Per-session eviction with CancelledError catch In-flight hang on server stop Per-call timeout Thundering herd on restart Exponential backoff with jitter Dead sessions hit before detected Background heartbeat Heartbeat tests wrong code path Dedicated ping tool on real call stack Blocking I/O breaks heartbeat Async I/O in all server tool functions Part 2 covers the system-level concerns: protecting PII-carrying tool calls in transit, making 8–10 calls across 4 servers traceable to a single case, and the operational decisions that make this deployable on a client system.\nPart 2: Authentication, Observability, and Operational Design →\n","permalink":"https://aitechy226.github.io/posts/mcp-production-part-1/","summary":"\u003cdiv class=\"series-banner\"\u003e\n  \u003cspan class=\"series-label\"\u003eMCP in Production \u0026middot; Part 1 of 2\u003c/span\u003e\n  \u003cspan class=\"series-links\"\u003e\u003ca href=\"/posts/mcp-production-part-2/\"\u003ePart 2: Authentication, Observability, and Operational Design →\u003c/a\u003e\u003c/span\u003e\n\u003c/div\u003e\n\u003cp\u003eMost MCP client examples open a session, call a tool, and close the session. That pattern is fine for demos. It breaks in production in ways that aren\u0026rsquo;t obvious until you\u0026rsquo;re staring at a hung process or a spike in latency.\u003c/p\u003e\n\u003cp\u003eThis is Part 1 of a two-part series on what it takes to run an MCP client reliably. I\u0026rsquo;ll cover the transport layer: sessions, pooling, dead connection recovery, timeouts, and the heartbeat. \u003ca href=\"/posts/mcp-production-part-2/\"\u003ePart 2\u003c/a\u003e covers the system layer: authentication, observability, and operational design.\u003c/p\u003e","title":"MCP in Production, Part 1: Persistent Sessions, Pooling, and Fault Tolerance"},{"content":" MCP in Production \u0026middot; Part 2 of 2 ← Part 1: Persistent Sessions, Pooling, and Fault Tolerance Part 1 covered the transport layer — keeping sessions alive, recovering from failures, and a few edge cases that only surface when you\u0026rsquo;re running a real pool under real failure conditions. This part covers what I\u0026rsquo;d call system readiness: the things that separate a working prototype from something I could hand to a client and say \u0026ldquo;deploy this.\u0026rdquo;\nAuthentication: I Almost Shipped PII Over Unauthenticated HTTP Tool call payloads in my KYC system carry legal_name, beneficial_owners, sanctions_clear, and pep_role. In a Docker Compose setup, every container in the network can call every MCP server directly — no authentication, no gate. I caught this before shipping, but it was closer than I\u0026rsquo;d like to admit.\nThe fix I landed on was bearer token authentication injected at the transport layer. When a session is opened, the token is passed as an HTTP header — so every session in the pool carries auth automatically, with no per-call overhead. On each MCP server, the token is validated before any request reaches the MCP layer.\nOne implementation detail worth sharing: I used pure ASGI middleware on the server side rather than Starlette\u0026rsquo;s BaseHTTPMiddleware. The reason is FastMCP uses Server-Sent Events for streaming. BaseHTTPMiddleware buffers responses, which breaks SSE. The pure ASGI approach intercepts at the connection level and never touches the response body. I learned that the hard way on my first attempt.\nOne MCP_BEARER_TOKEN env var is shared across all five containers. Empty means auth is off for local development. Set means the gate is live on every server.\nWhat This Doesn\u0026rsquo;t Cover I want to be honest about the limits. Bearer tokens over plain HTTP protect against unauthorized callers within the Docker network. They don\u0026rsquo;t encrypt payloads in transit — anything that can observe the Docker bridge can read the data. And they don\u0026rsquo;t verify caller identity: any container with the token is trusted equally.\nFor a real compliance deployment, mTLS is the right answer. It encrypts inter-service traffic and lets each server verify the caller\u0026rsquo;s identity specifically — not just that they have the shared secret. That\u0026rsquo;s the next step; bearer tokens are the floor, not the ceiling.\nCorrelation IDs: Debugging Across Four Servers Was a Nightmare Early on, debugging a failed case meant opening four log files and trying to correlate entries by timestamp. With 8–10 tool calls across 4 servers, that was genuinely painful.\nThe fix was straightforward once I decided to do it: thread a case_id through call_tool as a keyword-only parameter and inject it into every tool call\u0026rsquo;s arguments before it hits the server. Each server logs case_id at tool entry. Every log line for a given case, across all four servers, now shares one identifier. One grep gives you the complete trace.\nTwo design details I\u0026rsquo;m glad I got right: making case_id keyword-only means existing callers don\u0026rsquo;t break when you add it. And creating a new dict rather than mutating the caller\u0026rsquo;s arguments avoids subtle bugs when the same arguments dict gets reused. Small things, but worth getting right upfront.\nThree Operational Decisions That Made Deployment Easier Open sessions lazily, not at startup In Docker Compose, containers start in parallel. If I opened sessions at startup, the orchestrator\u0026rsquo;s readiness would depend on all four MCP servers being up first. Lazy initialization inverts that: the orchestrator starts immediately, the first tool call triggers pool filling, and the retry logic handles a server that isn\u0026rsquo;t ready yet. This also means adding a new server requires no changes to startup order or health-check configuration.\nAll thresholds in config, not code Every timeout, retry parameter, pool size, and heartbeat interval is an environment variable. A client deploying this system can tune for their network without touching code. I can also disable the heartbeat entirely with MCP_HEARTBEAT_INTERVAL_SECONDS=0, which simplifies test setups considerably.\nShut down cleanly I wire the heartbeat start and stop into the FastAPI lifespan context manager. On shutdown, close_all_sessions() cancels the heartbeat, waits for it to finish, then evicts every session in every pool. Without this, the server logs are full of errors from sessions closed mid-request. It\u0026rsquo;s a small thing that makes production logs much easier to read.\nTradeoffs Worth Knowing False positive evictions. If a healthy server is momentarily overloaded and the ping tool responds slowly, the session gets evicted unnecessarily. It recovers on the next call, but at the cost of a fresh session open. This is why the ping tool has no I/O — a slow response from a no-I/O tool is an unambiguous signal that something is wrong with the server, not normal variance.\nPool size vs. idle connections. Each session holds an open SSE connection for the lifetime of the process. A pool size of two means two idle connections per server, always open. Setting it too high multiplies that across all servers. I stayed at two — enough to tolerate one dead session without blocking callers.\nHeartbeat doesn\u0026rsquo;t protect in-flight calls. If a server dies while a tool call is in flight, that call fails and reactive eviction handles it. The heartbeat only helps the next call that would have hit a stale session. The in-flight failure won\u0026rsquo;t hang forever — it\u0026rsquo;s bounded by the per-call timeout — but it will still fail.\nConclusion Looking back across both posts, every decision traces back to something that broke during testing. I didn\u0026rsquo;t design any of this speculatively.\nPer-call overhead → session pool Single session fragility → pool with round-robin In-flight hangs → per-call timeouts Thundering herd → exponential backoff with jitter Late failure detection → heartbeat on the real code path Unauthenticated PII → bearer tokens at the transport layer Four-server log chaos → correlation IDs threaded through every call The systems I trust most are the ones built by breaking things deliberately and fixing the actual root cause — not the ones built by reading a resilience checklist upfront. Build it, break it, understand why, fix it. That\u0026rsquo;s the loop.\n← Part 1: Persistent Sessions, Pooling, and Fault Tolerance\n","permalink":"https://aitechy226.github.io/posts/mcp-production-part-2/","summary":"\u003cdiv class=\"series-banner\"\u003e\n  \u003cspan class=\"series-label\"\u003eMCP in Production \u0026middot; Part 2 of 2\u003c/span\u003e\n  \u003cspan class=\"series-links\"\u003e\u003ca href=\"/posts/mcp-production-part-1/\"\u003e← Part 1: Persistent Sessions, Pooling, and Fault Tolerance\u003c/a\u003e\u003c/span\u003e\n\u003c/div\u003e\n\u003cp\u003e\u003ca href=\"/posts/mcp-production-part-1/\"\u003ePart 1\u003c/a\u003e covered the transport layer — keeping sessions alive, recovering from failures, and a few edge cases that only surface when you\u0026rsquo;re running a real pool under real failure conditions. This part covers what I\u0026rsquo;d call system readiness: the things that separate a working prototype from something I could hand to a client and say \u0026ldquo;deploy this.\u0026rdquo;\u003c/p\u003e","title":"MCP in Production, Part 2: Authentication, Observability, and Operational Design"},{"content":"Over the last year, I\u0026rsquo;ve been building production-grade agentic AI systems — LangGraph state machines, multi-agent orchestration, deterministic validation pipelines designed for regulated environments. And somewhere in that work, I noticed something: the architecture I was using to build reliable AI agents was a pretty accurate model of how I actually operate professionally.\nSo I mapped it out. Not as a second brain or a structured resume. As an agent specification — a design exercise in making professional expertise explicit, structured, and transferable.\nThe full specification — personas, tools, skills, rules, and memory — is published alongside this post: The Sri System →\nIt's living documentation, updated when a new engagement changes how I think about something.\nWhat Most \u0026ldquo;Digital Twin\u0026rdquo; Attempts Get Wrong The standard approach to capturing professional expertise is knowledge capture: write down what you know, organize it into categories, maybe build a RAG corpus over it.\nThat produces something that can answer what does this person know about X?\nIt does not produce something that can answer how would this person approach X, and why?\nThe difference is behavioral modeling versus knowledge storage. One is a reference library. The other is an operating system.\nThe Architecture The system is organized into five layers, each with a specific role.\nPersonas — Context-Activated Operating Modes I show up differently depending on context. With a G-SIB financial institution, I\u0026rsquo;m a resilience architect — zero-failure and 100% auditability are design requirements, not aspirations. In a presales conversation, I\u0026rsquo;m a technical translator — the job is to make complex architecture legible across three stakeholder levels simultaneously without losing accuracy at any of them.\nThese aren\u0026rsquo;t roles I switch between. They\u0026rsquo;re operating modes that determine which workflows activate, which communication style applies, and what success looks like in that context.\nFour personas: FORWARD_SYNTHESIS (production AI architecture), FSI_RESILIENCE (regulated enterprise), INFRA_ORCHESTRATOR (infrastructure at scale), TECHNICAL_PRESALES (commercial execution).\nTools — Repeatable Workflows Each persona calls tools — not software tools, but repeatable professional workflows I\u0026rsquo;ve developed and refined over decades.\nENTERPRISE_DEAL_ARCHITECTURE is a process: technical architecture as the foundation, parallel stakeholder engagement across commercial and executive tracks, RFI/RFP shaped as a competitive instrument, ecosystem orchestration across product teams and partners. It\u0026rsquo;s not ad hoc. It\u0026rsquo;s a procedure I\u0026rsquo;ve run enough times to codify.\nDETERMINISTIC_AI_VALIDATION is another: typed data contracts, parallel claim verification, groundedness scoring, source attribution. Every factual assertion carries provenance. I built this because hallucination in a regulated environment isn\u0026rsquo;t a model limitation to accept — it\u0026rsquo;s a systems failure to design out.\nFive tools in total. Each documented as a methodology, not a description.\nSkills — Domain Knowledge Modules Tools orchestrate skills. Skills are the deep domain knowledge that gets loaded depending on what the workflow needs.\nFSI_ARCHITECTURE covers regulatory frameworks (SOX, DORA, Basel III), AI deployment in regulated environments, cloud and infrastructure strategy for G-SIBs, and the procurement mechanics of large financial institutions — risk committees, legal review, compliance sign-off, architecture review boards.\nGENAI_SOLUTION_DEVELOPMENT covers LangGraph state machines, hybrid retrieval, LLM backends, inference serving, governance and validation patterns.\nThe skills don\u0026rsquo;t activate independently. They get called by tools, which get called by personas. The hierarchy matters.\nRules — Invariant Constraints This is the layer most people miss when they think about professional identity.\nI operate under two rules that don\u0026rsquo;t change regardless of persona, context, or time pressure.\nBELL_LABS_STANDARD: Everything ships modular, documented, and auditable. No exceptions. This isn\u0026rsquo;t a preference — it\u0026rsquo;s a constraint derived from 30 years of watching what happens to systems when it\u0026rsquo;s violated. At Bell Labs, I was managing carrier-scale network infrastructure where a single undocumented dependency could cascade into service affecting millions of connections. The discipline was earned, not chosen.\nAUTOMATION_IS_THE_MISSION: Manual processes are defects. Any task performed identically more than twice belongs in code. This compounds — automating repeatable work at scale frees engineer capacity for problems that actually require human judgment. The leverage is the point, not the efficiency.\nRules propagate into every deliverable regardless of which persona is active or which tool is running. An agent without invariant constraints has knowledge but not judgment.\nMemory — Convictions With Provenance The memory layer isn\u0026rsquo;t a knowledge base. It\u0026rsquo;s a record of which environments changed how I think, and why.\nLehman Brothers collapse (2008): I was managing global core services during the firm\u0026rsquo;s failure. The lesson wasn\u0026rsquo;t about financial risk — it was about building systems that remain operable by whoever inherits them, under any circumstances. Technology that depends on institutional continuity for its own correctness is failed architecture.\nJP Morgan Chase (2010–2014): Automating lifecycle management for 200,000+ VMs across global data centers showed me that automation doesn\u0026rsquo;t just make things faster — it compounds. The capacity freed by eliminating manual processes gets redirected toward irreducible problems.\nDell BFSI (2015–2023): Eight years, $50M–$300M TCV deals, 110%+ quota for three consecutive years. The core conviction from that period: the deal is not the outcome. The customer\u0026rsquo;s production deployment is the outcome. Commercial relationships that emerge from authentic technical problem-solving are durable. Relationships built the other way aren\u0026rsquo;t.\nThese aren\u0026rsquo;t lessons — they\u0026rsquo;re load-bearing convictions. They\u0026rsquo;re why the rules are what they are.\nWhy This Is Worth Doing I structured this as a Claude Code project because it has two distinct uses.\nThe first is AI agent onboarding. I can give an AI agent this system as context and it operates with my communication style, my workflows, my constraints, and my reasoning — not a generic assistant mode. The persona → tool → skill hierarchy means the agent knows not just what I know, but when different knowledge gets activated and what principles govern the output.\nThe second is human onboarding. A new colleague, a client, a collaborator — they can read this and understand how I operate, what to expect, and why I make the decisions I make. Most professionals keep this in their heads. Making it explicit, structured, and legible is the difference between institutional knowledge and institutional knowledge that actually transfers.\nMost people have this system. Very few have made it explicit enough to be useful to anyone else.\nI design production-grade AI systems for regulated industries. If any of this connects with work you\u0026rsquo;re doing, the best place to reach me is LinkedIn.\n","permalink":"https://aitechy226.github.io/posts/professional-digital-twin/","summary":"\u003cp\u003eOver the last year, I\u0026rsquo;ve been building production-grade agentic AI systems — LangGraph state machines, multi-agent orchestration, deterministic validation pipelines designed for regulated environments. And somewhere in that work, I noticed something: the architecture I was using to build reliable AI agents was a pretty accurate model of how I actually operate professionally.\u003c/p\u003e\n\u003cp\u003eSo I mapped it out. Not as a second brain or a structured resume. As an agent specification — a design exercise in making professional expertise explicit, structured, and transferable.\u003c/p\u003e","title":"Designing a Professional Digital Twin: The Architecture"},{"content":"When I designed the architecture for my KYC onboarding orchestrator, I made a deliberate choice: use MCP not as an LLM-to-tool protocol — the way it was originally designed — but as a service-to-service protocol between a LangGraph orchestrator and a set of independently deployable integration servers.\nIt worked. But it came with real tradeoffs I want to document, because I don\u0026rsquo;t think this pattern is well understood yet.\nBackground: What I Built The system onboards corporate clients through a fixed sequence of checks — entity profile retrieval, credit rating, sanctions screening, PEP check, CRM update, document generation. Each of those integrations runs as a separate MCP server. A LangGraph graph orchestrates the sequence by calling MCP tools directly from its nodes.\nFour MCP servers. One orchestrator. No LLM in the routing path.\nThe Two Patterns Before getting into tradeoffs, it\u0026rsquo;s worth being precise about what distinguishes these patterns.\nThe original MCP pattern — LLM-to-tool: The LLM reads available tool schemas at runtime via list_tools(), decides which tool to call, executes it, interprets the result, and decides the next step. The LLM is the orchestrator. This is how Claude Desktop works with MCP servers.\nMy pattern — service-to-service: A LangGraph node calls specific tools by hardcoded name in a fixed sequence. The LLM never sees the tool schemas. MCP is purely the transport layer between services.\nTradeoff 1: Predictability vs Flexibility In my architecture, the call sequence is defined in the graph and cannot deviate. run_compliance_checks always follows fetch_company_data. A BLOCKED case always skips CRM. This is testable, traceable, and reproducible — the same entity run twice produces the same sequence of tool calls.\nThe cost is rigidity. If a new entity type requires a different check sequence, I have to change the graph. In the LLM-to-tool pattern, the LLM could theoretically adapt its tool use based on what it finds at runtime.\nIn practice I view that flexibility as a risk, not a feature. LLMs do not reliably honor sequencing rules given in system prompts. I made the deliberate choice to move routing logic into the graph precisely because the LLM cannot be trusted to enforce it consistently under all conditions. I traded flexibility for guarantees, and I\u0026rsquo;d make the same call again.\nTradeoff 2: Auditability vs Discoverability My pattern produces a complete, deterministic audit trail. Every tool call maps to a specific node, with specific inputs from state, producing specific outputs written back to state. When something fails, I know exactly where.\nWhat I gave up is discoverability. The list_tools() mechanism — which lets a client dynamically learn what tools a server exposes — is never called in my system. Nodes call tools by hardcoded name directly.\nI\u0026rsquo;m carrying the full weight of the MCP discovery mechanism — handshake, capability negotiation, schema generation — and using none of it. Discoverability matters when an LLM is the client and needs to learn what\u0026rsquo;s available at runtime. When the caller is a graph node written by a developer, that mechanism adds overhead without value.\nTradeoff 3: Integration Contracts vs Protocol Overhead This is where my choice is most defensible.\nEach MCP server defines an explicit, schema-enforced contract for its tools — generated automatically from Python type hints. The mock servers in the current build will eventually be replaced with real production integrations — real Moody\u0026rsquo;s API, real Refinitiv World-Check, real Salesforce. When that happens, the tool names, argument shapes, and response structures must stay identical. My orchestration layer changes nothing. MCP enforces that contract at the boundary.\nThe cost is protocol overhead I wouldn\u0026rsquo;t otherwise pay. Each tool call carries a tools/call JSON-RPC envelope over HTTP. That overhead is real but manageable with persistent pooled sessions — the connection and MCP handshake are established once and reused across all calls to that server, so a case with 8–10 tool calls pays the handshake cost twice (one per pool slot), not once per call. Even so, plain HTTP REST between containers would carry less overhead for the same job.\nREST with a well-defined API contract would have been the simpler choice. I chose MCP because I wanted a single standardized protocol across all four servers rather than four bespoke REST APIs — and because the tool schema generation from Python type hints gave me the contract enforcement for free.\nMCP earns its keep here not because of transport efficiency, but because of what it enables at integration time: a stable, versioned contract that makes the mock-to-production swap possible without touching the orchestration layer.\nWhen This Pattern Makes Sense Use MCP as LLM-to-tool when:\nThe LLM needs to dynamically discover and sequence tools at runtime The tool set is not fully known at build time Flexibility in tool selection matters more than sequencing guarantees Use MCP as service-to-service (my approach) when:\nYou need stable contracts across independently replaceable integrations Routing and sequencing must be deterministic and auditable Services are deployed independently and you want a standardized protocol rather than bespoke REST APIs per integration Use neither — just REST or direct function calls when:\nThe integrations are internal and will never be swapped You control both sides of every call The protocol overhead isn\u0026rsquo;t justified by any real integration boundary The Honest Summary Using MCP as a service-to-service protocol is not the natural fit for the protocol. You pay for discovery you don\u0026rsquo;t use. But the contract enforcement at integration boundaries is real, and for a system designed to swap mock integrations for production ones, that contract is exactly what you need.\nThe pattern works. Just go in clear-eyed about what you\u0026rsquo;re getting and what you\u0026rsquo;re giving up.\nBuilding production-grade agentic AI systems for regulated industries. If any of this connects with work you\u0026rsquo;re doing, reach me on LinkedIn.\n","permalink":"https://aitechy226.github.io/posts/mcp-service-to-service/","summary":"\u003cp\u003eWhen I designed the architecture for my KYC onboarding orchestrator, I made a deliberate choice: use MCP not as an LLM-to-tool protocol — the way it was originally designed — but as a service-to-service protocol between a LangGraph orchestrator and a set of independently deployable integration servers.\u003c/p\u003e\n\u003cp\u003eIt worked. But it came with real tradeoffs I want to document, because I don\u0026rsquo;t think this pattern is well understood yet.\u003c/p\u003e\n\u003chr\u003e\n\u003ch2 id=\"background-what-i-built\"\u003eBackground: What I Built\u003c/h2\u003e\n\u003cp\u003eThe system onboards corporate clients through a fixed sequence of checks — entity profile retrieval, credit rating, sanctions screening, PEP check, CRM update, document generation. Each of those integrations runs as a separate MCP server. A LangGraph graph orchestrates the sequence by calling MCP tools directly from its nodes.\u003c/p\u003e","title":"I Used MCP as a Service-to-Service Protocol. Here's What I Learned."},{"content":"I\u0026rsquo;ve spent the last several months building agentic AI systems — not demoing them, building them. And I want to share something that took me longer than I\u0026rsquo;d like to admit to fully internalize.\nThe hype is real. The gap is also real. And the gap is closing — but not in the way most people think.\nThis reflects where I am in March 2026, building on roughly 18 months of hands-on agentic work. The field is moving fast and I expect some of this to age.\nThe Demo Always Works. Here\u0026rsquo;s Why. Every impressive agent demo you\u0026rsquo;ve seen — the one where the AI autonomously calls five tools, synthesizes the results, and delivers a perfect output — was run until it worked, then recorded. The failures happened before the camera started. That\u0026rsquo;s not dishonest. That\u0026rsquo;s just how demos work. But it creates a perception of reliability that doesn\u0026rsquo;t survive contact with production.\nIn my own work, I built a multi-agent system using a ReAct-style architecture. The agent would reason about which tool to call next, execute it, observe the result, and decide the next step. In demos it looked extraordinary. In testing — running llama3.1:8b locally — it didn\u0026rsquo;t reliably honor system prompt directives for retry logic. It would skip steps, call tools out of order, hallucinate arguments. The same workflow that worked perfectly at 2pm would fail at 10am for reasons I couldn\u0026rsquo;t reproduce.\nTo be precise about this: with smaller local models, this failure is consistent and reproducible. With frontier models like Claude Sonnet the reliability picture is meaningfully better — tool calling is one of the areas where the capability gap between an 8B local model and a frontier API model is largest, not smallest. If I had been running Sonnet, the workflow would very likely have executed correctly.\nBut here\u0026rsquo;s the distinction that matters: reliability and auditability are different problems. Only one of them improves with a better model. More on that below.\nThat system got rebuilt regardless. The routing logic moved out of the LLM and into a deterministic orchestration layer. The LLM kept the one job it\u0026rsquo;s actually reliable at: generating human-readable narrative from structured inputs. The system became predictable, testable, and deployable.\nThe Field Has Already Validated This Pattern — Quietly Here\u0026rsquo;s something worth noting: I\u0026rsquo;m not describing a novel insight. The fact that frameworks like LangGraph, Temporal, and Prefect have become the default infrastructure for serious agentic deployments is the field converging on exactly this architecture. Deterministic orchestration with LLM-powered steps is increasingly how production systems get built — not because engineers are being conservative, but because the teams that tried the alternative learned the same lessons and moved on.\nThe tooling ecosystem didn\u0026rsquo;t build these frameworks for fun. They built them because LLM-controlled flow at production scale has a known failure profile and practitioners needed infrastructure to route around it.\nThe Four Buckets Most \u0026ldquo;Killing It\u0026rdquo; Claims Fall Into When I hear that someone is \u0026ldquo;crushing it\u0026rdquo; with AI agents, I\u0026rsquo;ve started mentally categorizing the claim:\n1. The task is forgiving. Blog posts, email drafts, document summaries. If the agent skips a step the output is still usable. The agent looks impressive because low-stakes failures are invisible.\n2. The demo is cherry-picked. See above.\n3. The definition of \u0026ldquo;agentic\u0026rdquo; is genuinely contested. The line between \u0026ldquo;function with an LLM inside\u0026rdquo; and \u0026ldquo;agent\u0026rdquo; is blurry and the field hasn\u0026rsquo;t settled on a definition — and that\u0026rsquo;s partly a legitimate conceptual ambiguity, not just a rhetorical sleight of hand. But the looseness of the term does create expectations that a lot of deployed systems can\u0026rsquo;t meet. Worth being honest about that gap even when the ambiguity is real.\n4. The guardrails are invisible. The production systems that are genuinely generating ROI — coding assistants, triage workflows, document extraction pipelines — almost all have a human in the loop, a constrained action space, or deterministic orchestration controlling the sequence. The \u0026ldquo;agent\u0026rdquo; is operating inside a box. The box is doing the heavy lifting.\nThe Mental Model I\u0026rsquo;ve Found Most Useful LLMs are reliable at generation. They are unreliable at decision-making under constraint.\nAsk an LLM to synthesize a narrative from structured evidence — it does this beautifully. Ask it to decide whether to escalate a KYC flag, route a transaction for review, or determine which of five tools to call next — and you\u0026rsquo;ve moved into territory where non-determinism is a liability, not a feature.\nI want to be precise here because this is where the nuance matters: the reliability gap is narrowing. Frontier models today are meaningfully better at tool-calling and sequencing than models from 18 months ago, and that improvement is real and continuing. The argument for deterministic orchestration is not that LLMs will never be reliable enough — it\u0026rsquo;s that reliability alone doesn\u0026rsquo;t solve the problem in regulated environments.\nThe Argument That Survives Model Improvements Even if models become perfectly reliable at tool-calling, the auditability argument doesn\u0026rsquo;t go away.\nA deterministic orchestration graph is inspectable. You can write a unit test that asserts step 3 always follows step 2. You can show a regulator exactly what sequence ran on a specific transaction and why. You cannot do that with an LLM decision, regardless of how reliable the model gets.\nIn regulated workflows — KYC, lending decisions, compliance screening — auditability is not an engineering preference. It\u0026rsquo;s a legal requirement. If an agent skips a sanctions check because the LLM decided the entity name was ambiguous, that decision is buried in a reasoning trace nobody reviewed, and you cannot guarantee it won\u0026rsquo;t happen again. That failure mode is unacceptable at any model tier.\nSo there are actually two separate arguments here worth keeping distinct:\nReliability argument — model-dependent, shrinking, probably less severe in two years than today. Using a frontier model materially reduces this risk today. Auditability argument — compliance-driven, durable, does not improve as models improve. This one doesn\u0026rsquo;t care what model you\u0026rsquo;re running. The first argument is a reason to be cautious about model choice. The second is a reason to build this way permanently, at least in regulated domains.\nWhere the Hype Is Genuinely Justified I don\u0026rsquo;t want this to read as pure skepticism — that would be equally dishonest.\nAgents are genuinely transformative for tasks that are high volume, have well-defined success criteria, and are either reversible or human-reviewable before action. Code generation, first-pass document triage, data extraction, customer-facing Q\u0026amp;A with retrieval — these are real productivity wins, and the productivity numbers cited for these categories are plausible. I\u0026rsquo;ve seen them.\nThe gap isn\u0026rsquo;t that agents don\u0026rsquo;t work. The gap is between what works in a demo and what works reliably in a regulated, auditable, production workflow where a skipped step has consequences.\nA Distinction Worth Making: Bounded vs. Unbounded Agents A useful distinction the field hasn\u0026rsquo;t settled on yet: bounded vs. unbounded agentic systems.\nAn unbounded agent has an open-ended action space — it decides what to do next at every step, from an unrestricted set of options. That\u0026rsquo;s what ReAct loops implement. It\u0026rsquo;s also what makes them hard to test, audit, and trust in production.\nA bounded agentic system takes a high-level goal and autonomously executes a constrained, deterministic workflow to achieve it — coordinating tools and systems the user never touches directly. The sequence is fixed. The decisions within the sequence are rule-based. The LLM synthesizes the output.\nThe second definition still qualifies as agency. It just qualifies as the kind that ships.\nWhere I\u0026rsquo;ve Landed — For Now The architecture I keep returning to: deterministic orchestration layer owns the flow, LLM owns the prose. Graph controls decisions, model generates interpretation. That\u0026rsquo;s not less agentic — it\u0026rsquo;s production-grade agentic. And in my experience, it\u0026rsquo;s a far better conversation to have with a risk or compliance team than \u0026ldquo;trust the model.\u0026rdquo;\nI\u0026rsquo;ll also say directly: this is a point-in-time view. The field is moving fast, the models are improving, and I\u0026rsquo;m genuinely open to being wrong about where the equilibrium lands. But the auditability argument feels durable to me regardless of where model capability goes.\nThis has been my experience. I\u0026rsquo;m curious whether it matches yours — or whether you\u0026rsquo;re seeing something in production that pushes back on this. What are you building, and what\u0026rsquo;s actually working?\nBuilding AI infrastructure tooling and agentic systems. Always interested in the gap between what\u0026rsquo;s demoed and what ships.\n","permalink":"https://aitechy226.github.io/posts/agentic-hype-vs-reality/","summary":"\u003cp\u003eI\u0026rsquo;ve spent the last several months building agentic AI systems — not demoing them, building them. And I want to share something that took me longer than I\u0026rsquo;d like to admit to fully internalize.\u003c/p\u003e\n\u003cp\u003eThe hype is real. The gap is also real. And the gap is closing — but not in the way most people think.\u003c/p\u003e\n\u003cp\u003e\u003cem\u003eThis reflects where I am in March 2026, building on roughly 18 months of hands-on agentic work. The field is moving fast and I expect some of this to age.\u003c/em\u003e\u003c/p\u003e","title":"Why Your AI Agent Demo Looks Great and Your Production System Doesn't"},{"content":"I\u0026rsquo;m Srikanth Samudrla — Field CTO and Solutions Architect specializing in AI infrastructure and enterprise agentic systems.\nThis blog is where I write about what I\u0026rsquo;m actually building — not what I\u0026rsquo;m demoing. The gap between those two things is larger than most people admit, and I think that gap is worth writing about honestly.\nWhat you\u0026rsquo;ll find here:\nAgentic system architecture — what works in production, what doesn\u0026rsquo;t, and why AI infrastructure — GPU clusters, data center validation, the physical layer that makes LLMs run Practitioner observations — field notes from sitting across the table from the teams responsible for deploying this stuff Connect:\nLinkedIn ","permalink":"https://aitechy226.github.io/about/","summary":"About The Practical AI Builder","title":"About"},{"content":"This site is about building AI systems that actually work in production — not demos. Here\u0026rsquo;s a map of the writing and how the pieces connect.\nOn GPU Infrastructure GPU infrastructure decisions look like price comparisons. They\u0026rsquo;re not — they\u0026rsquo;re configuration problems, data placement problems, and interconnect problems that $/hr doesn\u0026rsquo;t capture.\nGPU Infrastructure: The Five Calculations That Actually Matter New — A framework built while building a GPU recommendation engine, with a full worked example on a 70B fine-tuning scenario. On Enterprise RAG What it takes to build a RAG system a compliance officer or clinical analyst can actually rely on — deterministic retrieval, evidence gating, and the gap between working and trustworthy.\nThe Trust Layer: What Separates Good RAG from Enterprise RAG New — Four bugs found while stress-testing a RAG system for regulated industries, and the architectural properties they reveal. On AI PC Evaluation Why enterprise AI PC procurement is harder than vendor benchmarks suggest — and what it takes to measure the right things.\nThe AI PC Buying Problem Every Enterprise Needs to Solve New — Three user groups, one procurement decision, and why vendor metrics don\u0026rsquo;t answer the question IT leaders actually need answered. On Agentic Architecture The foundation. What agentic systems actually look like when you move past the demo and into something deployable.\nWhy Your AI Agent Demo Looks Great and Your Production System Doesn\u0026rsquo;t — The gap between demos and production, the four buckets most \u0026ldquo;crushing it\u0026rdquo; claims fall into, and the architecture pattern I keep returning to.\nDesigning a Professional Digital Twin: The Architecture — What it looks like when you model professional expertise as an agent specification — personas, tools, skills, rules, and memory. Includes the full Sri System.\nOn MCP in Production A ground-up account of using the Model Context Protocol as a service-to-service layer in a regulated KYC system — the tradeoffs of the pattern and what it takes to run it reliably.\nI Used MCP as a Service-to-Service Protocol. Here\u0026rsquo;s What I Learned. — Why I used MCP as a transport layer between a LangGraph orchestrator and four integration servers, and the tradeoffs that come with it.\nMCP in Production, Part 1: Persistent Sessions, Pooling, and Fault Tolerance — Five transport-layer decisions, each driven by a real failure: session pooling, dead connection eviction, cancel scope isolation, timeouts, and heartbeat design.\nMCP in Production, Part 2: Authentication, Observability, and Operational Design — Bearer token auth at the transport layer, correlation IDs across four servers, lazy session init, and clean shutdown.\nNew posts go to /posts/. Organized by topic as the archive grows.\nSubscribe to get new posts by email.\n","permalink":"https://aitechy226.github.io/start/","summary":"A map of the writing on this site — where to start and how the pieces connect.","title":"Start Here"}]