Pass-by-Reference: A Software Lesson the AI Frameworks Haven’t Picked Up

Most AI agent stacks have a hidden cost most people aren’t accounting for. When a host model delegates work to a cheaper specialized one (summarization, extraction, classification, that sort of thing), it doesn’t just write a short delegation message; it has to compose and emit the entire content it wants the delegate to operate on. Which means the host model, the one you’re paying expensive-tier output rates for, ends up acting as a literal courier, picking up the package at one end of the chain and dropping it off at the other, billed by the syllable for the privilege.

The Host as Courier

On the early FlashQuery site I used a more colloquial framing for this: the knowledge worker, I said back then, is constantly DoorDashing for the LLM, ferrying documents to it and ferrying its responses back. Once the LLM starts delegating to other LLMs, the host model ends up DoorDashing for the delegate too, on the knowledge worker’s tab. The metaphor was meant to be funny; the bill turns out to be real.

The waste is not subtle once you look at it. A typical delegation flow runs through four steps: the host reads the document into its own context, composes a delegation message with that content embedded inline, the delegate processes it, and the host receives a result (sometimes with a chunk of the source echoed back, depending on the workflow). The second step is the one I find most troubling. The host model is paying expensive-tier output costs to compose and emit thousands of tokens of content it is not reasoning about.

The math becomes embarrassing fast. A 50 KB document is roughly 15,000 tokens. At Claude Sonnet 4.6’s $15 per million output tokens (and Opus 4.6 at $25, GPT-class models running between $15 and $30, per the published rate cards for Anthropic and OpenAI), that’s roughly twenty-three to forty-five cents per delegation, paid for nothing but message composition. Multiplied across the thousands of delegations a busy agentic workflow runs in a month, the courier tax adds up to numbers most teams would prefer not to see on an invoice. (And cache-the-courier just got more expensive too; Anthropic shortened its default prompt-cache TTL from sixty minutes to five in early April 2026 and made the one-hour TTL a paid premium.)

The Fix, In One Picture

The mechanism is simple. The host writes a placeholder in the message, the orchestration layer scans for the placeholder syntax, resolves the reference against the vault, and replaces it with the resolved content before dispatching the call to the delegated model. The host language model only ever handles the placeholder and the response.

A concrete example helps. Suppose the host wants a summarization model to summarize a document that lives in the vault. The document I’ll use is real, and it’s recursive in a way I rather like: it’s the original research document for this very feature, sitting at SomeFolder/SomeBigFile.md in the FlashQuery product repository, roughly 179 kilobytes and around 45,000 tokens.

Before — pass-by-value

{
  "model": "summarizer",
  "messages": [
    {
      "role": "user",
      "content": "Summarize the key decisions in this document:\n\n[~45,000 tokens of document content inlined here]"
    }
  ]
}

The host LLM composes and emits all 45,000 tokens of the document at its own output rate, plus the framing prose. At Sonnet 4.6’s $15 per million, that’s about sixty-seven cents paid by the host model to do nothing but courier the document across to the delegate. At Opus 4.6’s $25 per million, the same single delegation runs about a dollar and thirteen cents.

After — pass-by-reference

{
  "resolver": "purpose",
  "name": "summarization",
  "messages": [
    {
      "role": "user",
      "content": "Summarize the key decisions in this document:\n\n{{ref:SomeFolder/SomeBigFile.md}}"
    }
  ]
}

The host composes and emits roughly 25 tokens. FlashQuery hydrates the reference into the delegated model’s call only; the host never sees the content. That’s an approximately 1,800x reduction in what the host has to compose and emit for a single delegation. The delegated model’s own cost is unchanged, because that work was always going to happen at the delegate; what changed is whether the host paid expensive-tier output rates to courier the content there in the first place.

Selective context is also supported, which matters when the delegate doesn’t need the whole document either: a reference like {{ref:SomeFolder/SomeBigFile.md#Open Questions}} injects just the named section, identified by heading. A third variant, the pointer dereference ({{ref:SomeFolder/SomeBigFile.md->projections.summary}}), follows a named pointer declared in the source document’s frontmatter and resolves to whatever target it names. The most useful pattern under that primitive is what we’ve taken to calling a projection: a precomputed, token-light derivation of a source document (a summary, an extracted list, a structured table for a specific downstream skill) that the skill reads instead of the full source, with the savings compounding across every invocation. As best I can tell from a search of the field, nobody else has named this exact pattern in the LLM/agent context yet.

Pass-by-Reference, in Plain Terms

If you’ve spent time in software, this is just pass-by-value vs. pass-by-reference, a question the field settled decades ago in favor of references for anything of meaningful size. The mutation-safety concern that drove a lot of the original debate doesn’t really apply at the LLM orchestration layer either, because the delegated model is reading the content, not changing it. So the remaining question is just cost, and the cost question gets answered the same way it always has: send the pointer, not the bytes.

The underlying language models can handle references just fine; the frameworks above them mostly haven’t been built to provide any. The lesson exists; it just hasn’t made the jump.

Why Nobody Has Built This End-to-End

The fix is really two parts, and each part has been pursued at length by various corners of the AI tooling ecosystem; what has been missing is the combination, integrated end-to-end.

The first part is matching the model to the task: use cheaper specialized models for the work they’re good at and reserve expensive reasoning models for actual reasoning. This is settled wisdom, and tooling exists in many shapes (LiteLLM, Portkey, and Bifrost as provider gateways; LangChain and Semantic Kernel as templating layers; the various RAG frameworks).

The second part is the one most of the ecosystem has been ignoring: deliver the work to that cheaper model without paying the host’s expensive-tier delivery cost. Most of the savings from the first part get wiped out by the cost of the second if you don’t address both.

“The pieces are there; nobody has stitched them together.”

Even MCP, which has both reference and inline-content primitives in its specification, only flows them server-to-host rather than host-to-delegated-model, which is the opposite direction from what this problem requires. The pieces are there; nobody has stitched them together.

Counterarguments Worth Addressing

If you’ve been reading and nodding while also drafting objections in your head, that’s the right response. Two are worth pre-empting because a careful reader will reach for one of them first.

MCP ResourceLink and EmbeddedResource. This is the most informed counter, and the one that took me longest to think through. The MCP specification already defines both a pointer type and an inline-content type for the envelope returned by tool calls. Both, however, flow from the MCP server back to the host as part of a tool result. The host then decides whether to put any of that content into its own context. The pattern this article is about flows the opposite direction: the host writes a placeholder, and a gateway resolves it into the call going to the delegated model, with the host’s context untouched. Same primitive shapes, opposite directionality, opposite economic effect.

Anthropic’s Code Execution with MCP. Anthropic’s November 2025 post on code execution with MCP reports a startling 98.7% token reduction by handing filesystem paths between an agent and its sandboxed tools instead of serializing data through the model. This isn’t really a competitor to by-reference hydration; it’s the closest external validation I’ve found that the host-as-courier tax is a measurable, real problem worth solving. Their pattern handles agent-to-tools indirection through a filesystem. The pattern here handles host-to-delegated-model indirection through reference strings in a message. Same economic insight, different layer; I’d think of them as a complementary pair.

Where This Leaves Us

Andrej Karpathy summed up where AI engineering had ended up after two years of agentic-tooling experimentation when he observed that “deciding how you organize your context layer is one of the single most important things you can do in 2026.” That’s the right diagnosis, in my view, and the writers who have built on top of it (LangChain’s context-engineering taxonomy, Dex Horthy’s 12-Factor Agents, Jeremy Daly’s work on context engineering as a commercial discipline) have done valuable work explaining the discipline.

What I find missing, still, is a primitive that implements the isolate half of LangChain’s taxonomy at the message-composition layer: a way to write a short reference in a host prompt and have the routing infrastructure materialize the content into the delegated call only. FlashQuery is, among other things, what that primitive looks like as actual infrastructure rather than as a discipline you have to remember to apply: the host writes a reference, FlashQuery hydrates it, the delegated model receives the content, and the discipline becomes the default.

FlashQuery is open source, and the by-reference design lives in the repository alongside the rest of the specification. If any of this resonates with how you’d like to be orchestrating your own document-heavy workflows, the code is there to read, fork, or extend, and contributions are welcome.

Sources

Matt Genovese

Founder at FlashQuery to give enterprises sovereign control over their AI control plane.

The Host as Courier

The Fix, In One Picture

Pass-by-Reference, in Plain Terms

Why Nobody Has Built This End-to-End

Counterarguments Worth Addressing

Where This Leaves Us

Sources

More from the FlashQuery Blog

Introducing FlashQuery Open Source: Your Data, Every AI, No Lock-In

The App Is Dissolving.

Comprehension Lock-In: The Enterprise AI Risk Hiding Behind the Model Wars

Ready to stop building AI infrastructure and start shipping AI features?