Expensively Quadratic: the LLM Agent Cost Curve

Pop quiz: at what point in the context length of a coding agent are cached
reads costing you half of the next API call? By 50,000 tokens, your
conversation’s costs are probably being dominated by cache reads.

Let’s take a step back. We’ve previously
written about how coding agents work:
they post the conversation thus far to the LLM, and continue doing that in
a loop as long as the LLM is requesting tool calls. When there are no
more tools to run, the loop waits for user input, and the whole cycle starts
over. Visually:

Or, in code form:

def loop(llm):
    msg = user_input()
    while True:
        output, tool_calls = llm(msg)
        print("Agent: ", output)
        if tool_calls:
            msg = [handle_tool_call(tc)
                   for tc in tool_calls]
        else:
            msg = user_input()

The LLM providers charge for input tokens, cache writes, output tokens, and cache reads.
It’s a little tricky: you indicate in your prompt to cache up to a
certain point (usually the end), and you get charged as “cache write” and not input.
The previous turn’s output becomes the next turn’s cache write. Visually:

Here, the colors and numbers indicate the costs making up the nth call to the
LLM. Every subsequent call reads the story so far from the cache, writes the
previous call’s output to the cache (as well as any new input), and gets an
output. The area represents the cost, though in this diagram, it’s not quite
drawn to scale. Add up all the rectangles, and that’s the total cost.

That triangle emerging for cache reads? That’s the scary quadratic!

How scary is the quadratic? Pretty squarey! I took a rather ho-hum feature
implementation conversation, and visualized it like the diagram above. The area
corresponds to cost: the width of every rectangle is the number of tokens and
the height is the cost per token. As the conversation evolves, more and more
of the cost is the long thin lines across the bottom that correspond to cache
reads.

The whole conversation cost $12.93 total or so. You can see that as the
conversation continues, the cache reads dominate. At the end of the
conversation, cache reads are 87% of the total cost. They were half the cost at
27,500 tokens!

This conversation is just one example. Does this happen generally? exe.dev’s
LLM gateway keeps track of the costs we’re incurring. We do not store the
messages themselves going past, but we do keep track of the number of tokens. The
following graph shows the “cumulative cost” visualization for many Shelley
conversations, not just my own. I sampled 250 conversations from the data
randomly.

The x-axis is the context length, and the y-axis is the cumulative cost up to
that point. The left graph is all the costs and the right graph is just the
cache reads. You can mouse over to find a given conversation on both graphs.
The box plots below show the distribution of input tokens and output tokens.

The cost curves are all different because every conversation is different. Some
conversations write a lot of code, so spend more money on expensive output
tokens. Some conversations read lots of the code base, so spend money on tool
call outputs, which look like cache writes. Some conversations waited for the
user while the cache expired, so have to re-write data to the cache. In our
data, the median input was about 285 tokens and the median output was about
100, but the distribution is pretty wide.

Let’s look at how some conversations got to 100,000 tokens. We sampled from the
same data set, but excluded short conversations under 20 calls and also
excluded conversations that didn’t get to 100,000 tokens.
The number of LLM calls in the conversation matters quite a bit. The cache read
cost isn’t really the number of tokens squared; it’s the number of tokens times
the number of calls, and different conversations have very different numbers of
LLM calls!

To go back to our original question, we can build a little simulator.
Anthropic’s rates are x for input, 1.25x for cache write, 5x for output,
and x/10 for cache read, where x = $5 per million tokens for Opus 4.5. In
the default settings of the simulator, it only takes 20,000 tokens to get to
the point where cache reads dominate.

As a coding agent developer and as an agent loop user, this cost structure gives
me a lot to think about!

One metaphor for this is “dead reckoning.” If we let the agent navigate a long
task without feedback (in the form of tool calls and lots of back and forth),
it will be cheaper, but, on the other hand, we know that it’s the feedback that
lets the agent find the right destination. All things being equal, fewer LLM calls is
cheaper, but is the agent’s internal compass off and is it going off in the
wrong direction?

Some coding agents (Shelley included!) refuse to return a large tool output
back to the agent after some threshold. This is a mistake: it’s going to read
the whole file, and it may as well do it in one call rather than five.

Subagents and tools that themselves call out to LLMs are a way of doing iteration
outside of the main context window. Shelley uses a “keyword search” tool, for example,
as an LLM-assisted grep.

Starting conversations over might feel like it loses too much context, but
the tokens spent on re-establishing context are very likely cheaper than
the tokens spent on continuing the conversation, and often the effect
will be the same. It always feels wasteful to start a new conversation,
but then I remember that I start new conversations from my git repo
all the time when I start on a new task; why is continuing an existing
task any different?

Are cost management, context management, and agent orchestrations all really
the same problem? Is work like Recursive Language
Models the right approach?

These issues are very much on our minds as we work on exe.dev and
Shelley. Let us know what you think!

Source link

Leave a Reply Cancel reply