1 million token context: The good, the bad and the ugly


The following table looks at the inference performance of:
  • Meta’s Llama 4 Maverick model
    • 400 billion parameters
    • FP8 weights
    • Maximum context window of 1,000,000 tokens

  • Running on an NVIDIA DGX B200 server
    • 8x NVIDIA B200 GPUs
    • 1.4TB of HBM3E

  • Single user
  • 1,000 token output

We’re using a single user to keep things simple, but multiple concurrent users would increase the load and KV cache size considerably.

We see that small contexts are easily handled with prefill times under a second up to 10,000 tokens (and beyond). But when we’re at the maximum context length, the prefill time (the time to first token) is over 2 minutes before the LLM can start generating any output.

If we multiply out to 10 users who each have a context of 250,000, the prefill time is over 30 seconds, and the total KV cache size is 39GB.

This is the ugly side of large contexts — that computing the KV values for long contexts significantly influences the user experience. Unless we can improve prefill time, long contexts will not be appropriate for interactive use cases.

Why should we reuse KV caches?

If we go back to my original example for using a long context, AI coding assistants, we expect that the context for successive queries will have significant reuse. As the coding assistant generates methods on a class, the base class will be accessed for every query.

Instead of generating the KV cache for every query, we should generate it once and then reuse it for successive queries. The problem with this approach is the size of the KV cache. Even with optimizations in the most recent LLMs, the KV caches will consume all the memory in the AI system.

Offloading to the rescue!

The NVIDIA Dynamo library has a KV cache manager that migrates KV caches from GPU memory to other available devices. It also implements KV reuse techniques that turn the KV cache into a KV pool shared between multiple sessions and users. That 1,000,000-token KV cache (~15GB) that we discuss moving across from the HBM is just a small fraction of a much larger KV pool.

The first stop is system memory. The DGX example supports 4TB of system memory with an aggregate bandwidth of 1 TB/s to the GPUs. With this much bandwidth, loading the KV cache for 1,000,000 tokens from CPU memory to GPU memory (in the ideal case) would take only 15 ms compared to recomputing it, which would take over 2 minutes, as we mentioned earlier.

This reduction in time is great for the user! After initial computation of the context, reloading from CPU memory is extremely fast and enables the interactive user experience.

CPU system memory still faces the same problem as GPU memory — limited capacity.

The next stop encompasses the local NVMe drives in the DGX server. With eight PCIe Gen5 NVMe drives (like the Micron 9550 high-performance Gen5 drive), the next tier of KV cache offload capacity can range from 30TB to 245TB, with an aggregate bandwidth of 112 GB/s.

While the storage layer is significantly slower than system memory, that 1,000,000-token KV cache will still only take 140 ms to load.

Using these offload and migration techniques improves the user experience by reducing the time to generate the first token. And it’s also great for total cost of ownership (TCO) since the GPUs spend more of their time generating output instead of redoing work they’ve done before.

High-performance storage enables long context inference

The generative AI landscape has shifted considerably in the past couple years. Where storage was previously an afterthought, we see that providing high-performance storage for an AI factory will be critical to enabling intelligent AI with a positive user experience.

At FMS last year and at GTC this spring, Micron showcased the performance of our upcoming PCIe Gen6 NVMe drive. These developments around long contexts and KV cache management show that having the fastest NVMe flash connected to the latest GPUs will be key to successfully deploying AI.

 



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *