You can speed up long-context LLMs by caching and reusing key-value (KV) pairs in attention layers to avoid redundant computation over previous tokens.
Here is the code snippet below:

In the above code, we are using the following key points:
-
kv_cache stores past key and value tensors to reduce recomputation.
-
Attention computation is optimized by concatenating cached and new KV pairs.
-
Cache updating uses detach() to avoid backward path through cache.
Hence, KV-store optimizations enhance efficiency in long-context LLMs by eliminating repeated attention over prior tokens during inference.