Skip to content

gsethdev/agentcostPOV

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 

Repository files navigation

Agentic AI at Scale: What Building an MCP Server Taught Me About Cost

A PM's perspective on why agentic architecture decisions are really infrastructure economics decisions in disguise.

After two decades in technology, I’m still amazed at how fast I need to keep learning, especially now, as artificial intelligence isn’t just reshaping our products but transforming the underlying economics. Staying on top of these shifts isn’t optional; it’s what keeps me energized and effective. As a PM working at the intersection of AI and cloud platforms, I spend a lot of time thinking about both.

Earlier this month I published the IIS Migration MCP Server, an agentic system that guides teams through migrating ASP.NET workloads to Azure App Service. It's early-stage work, built to explore what AI orchestrated migration looks like in practice and building it taught me more than I expected about infrastructure economics.

The scale of a pilot means cost doesn't register as a real problem. But the architecture decisions you make at this stage are exactly the ones that become expensive later. As agentic systems like this move toward production, cost is going to be one of the first questions that surfaces and I'd rather think it through now than retrofit it after the fact.

This post is my attempt to share that thought. Less a set of answers, more a framing of the questions worth asking early.

Something worth thinking about in multiphase agentic systems

One thing that became clearer to me as I built the IIS migration pilot is that a single user interaction in a multiphase agentic system isn't one model call - it's several. In the migration system, a full end-to-end run touches five phases, each handled by a specialist subagent coordinated by a root orchestrator. Even modest interaction involves multiple inference passes.

Each of those passes carries context: system prompts, tool schemas for all 13 MCP tools, conversation history, the JSON output from the previous phase. By the time you're deep into Phase 3 generating an ARM template based on AppCat analysis, that context window has grown considerably.

I don't raise this as a problem with architecture. There are good reasons why each phase needs context from the previous one. But it's a pattern worth being deliberate about as these systems grow.

Here's a concrete illustration of why it compounds quickly. Say your system prompt and tool schemas together total 5,000 tokens - a reasonable figure for a well scoped agentic system. If the agent takes 5 turns to complete a task, you haven't paid 5,000 tokens. You've paid for 25,000 tokens because that full context is resent on every single inference pass, before the model has generated a single token of useful output. Layer in retrieved documents or a growing conversation history on top of that, and the number climbs further. At pilot scale this is invisible. At 10,000 task completions a month, a meaningful share of your inference bill is simply the cost of retransmitting the same system prompt, repeatedly.

This isn't a reason to avoid agentic architecture. It's a reason to deliberate about context discipline early keeping schemas lean, scoping each subagent's prompt to only what that phase needs, and thinking carefully about what travels through the full loop versus what can stay local to a single step.

Why this is a hardware constraint, not just a billing one

I'll be honest I didn't fully appreciate the infrastructure side of this until I started digging into how LLM inference works on the hardware underneath it.

The short version: GPU memory (VRAM) is the binding constraint, not raw processing power. When a model processes a long context, it stores mathematical representations of those tokens in fast GPU memory what's called the KV cache. Agentic workloads are context hungry by design. Once that memory fills up, you can't batch additional requests efficiently, utilization drops, and cost per output token rises.

For teams running models on provisioned cloud GPU infrastructure, monitoring memory utilization alongside raw GPU utilization gives a more complete picture of efficiency and the two don't always tell the same story.

Understanding this changed how I thought about the design of the IIS migration system specifically. It wasn't just about making fewer API calls. It was about making the right kind of calls, matching the right compute.

Design choices in the IIS migration pilot that had cost in mind

Some of the decisions I made while building the pilot weren't primarily about cost but looking back, they happen to be good cost practice too. Worth naming them explicitly.

Separating orchestration from execution

The orchestrator agent (@iismigrate) doesn't do the heavy analytical work itself. It delegates to specialist subagents — iis-discover, iis-assess, iis-recommend, iis-deploy-plan, iis-execute. Each subagent has a narrow, well-defined job with its own context scope.

This matters because each subagent call carries only the context relevant to its phase. The assessment agent doesn't need the deployment planning prompt. The execution agent doesn't need the full AppCat analysis. Narrower context per call means less VRAM pressure per inference pass, and more opportunity to right size the model to the task.

Human gates as a cost control mechanism

Every phase transition in the system requires explicit human confirmation. Phase 1 to 2: do you want to assess these sites? Phase 3: do you agree with the recommendation? Phase 5: this will create billable Azure resources, type 'yes' to confirm.

I designed these gates primarily for safety reasons. But they also happen to be good cost hygiene. The agent doesn't speculatively run the next phase. It stops, presents what it found, and waits. If the user decides to adjust course, you haven't burned inference budget on downstream work that's about to be discarded.

Structured outputs for routine steps

A lot of the work in an agentic migration loop is structured output generation — the ARM template, the install.ps1 PowerShell script, the MigrationSettings.json. These are deterministic, schema bound outputs. They don't require deep reasoning from a frontier model; they require reliable structured generation from a model that knows the schema well.

In practice, smaller, faster models finetuned for structured output can handle these steps with high reliability at a fraction of the cost of routing them through a general purpose frontier model. If you're building similar systems, this is worth exploring. The routing logic is simple: if the step is schema bound and the output space is well defined, it's a candidate for a smaller execution model.

Matching compute to workload — the tiered approach

Considering hardware is essential, since not all agentic loop steps require the same GPU. It’s helpful to view inference workloads in three tiers, each with its own model type and compute needs. Using a frontier model for every step is costly, like having your top engineer book meetings.

Workload type Model class Compute fit
Complex reasoning & synthesis Frontier models High memory instances (e.g., Azure ND series) worth the cost for genuine reasoning tasks
Structured output & tool execution Small/finetuned models Dense, cost-efficient GPU instances well-matched to small models on short, schema bound contexts
Intent routing & classification Micromodels or heuristics Lightweight instances or CPU class compute is often well matched here.

Mapping the IIS migration pilot against these tiers is instructive. The orchestrator deciding whether to route a site to Managed Instance vs standard App Service — weighing AppCat findings, readiness check results, confidence levels — is a genuine reasoning task. That belongs to tier one. Generating the ARM template for registry adapters is schema bound and deterministic; the output space is well defined. That's a tier two problem. Deciding which specialist subagent handles the next phase is closer to classification than reasoning — tier three territory.

In the pilot, all this runs through a single model. That's fine for exploration. But if this were heading toward production at enterprise scale hundreds of sites, repeated runs, concurrent users - collapsing all three tiers into one would be where the cost pressure accumulates.

The practical implication: before you commit to a model tier for your agentic system, it's worth mapping each step in your loop to one of these categories. The steps that genuinely require reasoning are probably fewer than you'd initially assume. The steps that are just structured generation or routing are candidates for smaller, faster, cheaper models, with no meaningful quality tradeoff if the schema is tight.

What this means for PMs building agentic products

My main takeaway from this work is that product decisions and infrastructure economics are tightly coupled in agentic AI in a way that isn't true for conventional software. Adding a new tool to an agent, expanding the context window, increasing phase granularity each of these has a direct cost implication that's worth understanding before the decision is made, not after the bill arrives.

A few things I'd suggest thinking about early:

•        Track cost per completed task from the beginning, not just latency and quality. It's the metric that honestly reflects the unit economics of what you're building.

•        Design human checkpoints into multiphase workflows — not just for safety, but because they're natural cost control gates that prevent speculative inference on discarded work.

•        Be deliberate about which steps actually need a frontier model. Structured output generation and intent routing are strong candidates for smaller, cheaper models.

•        If you're on provisioned infrastructure, talk to your infra team about workload profiling before you're at scale. Memory utilization tells a different story than GPU utilization.

A broader note on the infrastructure landscape

Cloud infrastructure providers are actively building infrastructure optimized for agentic workloads - better batching strategies, KV cache offloading, instance types purpose built for inference at different context lengths. This space is moving fast, and the right infrastructure answer for your workload today may look different in a year.

What won't change is the underlying constraint: memory is the binding resource, not compute. Agentic systems that are designed with that in mind narrow context per step, rightsized models per task, deliberate human gates — will have structurally better economics than ones that treat every inference call as equivalent.

About

Cost of executing multi-agent workflows

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages