SRAM Tiling and KV Cache Optimization for Local LLMs on Next-Gen NPUs

SRAM Tiling and KV Cache Optimization for Local LLMs on Next-Gen NPUs

SRAM Tiling and KV Cache Optimization for Local LLMs on Next-Gen NPUs

By Rizowan Ahmed (@riz1raj)
Senior Technology Analyst | Covering Enterprise IT, Hardware & Emerging Trends

Let’s stop pretending that running local Large Language Models (LLMs) on edge hardware is a FLOPs problem. It is not. Modern system-on-chips (SoCs) boast NPUs capable of pushing high TOPS. Instead, we are hitting a physical, uncompromising wall: memory bandwidth. While your NPU’s compute engines sit idle, waiting for data, your LPDDR5X memory bus is screaming at 100% utilization, draining the battery and throttling performance.

The primary culprit is the Key-Value (KV) cache. As context windows expand, the KV cache size scales linearly, quickly ballooning into gigabytes. Fetching this cache from system DRAM for every single generated token is an architectural challenge. To achieve acceptable token-per-second metrics on consumer hardware, you must keep the active KV cache on-chip. But with NPU Static Random-Access Memory (SRAM) strictly limited on modern silicon, you cannot fit the entire cache at once. You must tile it.

This guide provides an architectural blueprint on how to optimize kv cache tiling in npu sram for local llms, specifically focusing on the intersection of hardware-level scratchpad management, asynchronous Direct Memory Access (DMA) transfers, and speculative decoding eviction policies.

The Physics of the Memory Wall: SRAM vs. DRAM

To understand why tiling is non-negotiable, we must look at the hardware realities of modern edge NPUs (such as Apple’s M-series Neural Engine, Qualcomm’s Hexagon NPU, or Intel’s Lunar Lake NPU architectures).

  • SRAM Bandwidth: Extremely high, operating at sub-nanosecond latencies directly adjacent to the tensor execution cores.
  • DRAM Bandwidth: Significantly lower than SRAM, with latencies in the tens of nanoseconds.
  • SRAM Capacity: Highly constrained due to silicon real estate and leakage current.
  • DRAM Capacity: Typically 8GB to 24GB on standard edge devices.

Every time your NPU has to fetch a KV tensor from DRAM instead of SRAM during the autoregressive generation phase, the execution pipeline stalls. To prevent these stalls, the goal is simple: maximize SRAM residency time for active KV blocks and overlap DRAM-to-SRAM transfers with active tensor computation.

Anatomy of NPU SRAM Tiling

Standard software-level paging, such as vLLM’s PagedAttention, works well for server-grade GPUs with virtual memory systems. However, edge NPUs often lack complex Memory Management Units (MMUs) for dynamic virtual-to-physical address translation. Instead, they rely on software-managed scratchpads where the compiler or runtime must explicitly schedule data movement.

1. Block-Structured KV Tiling

To optimize for the NPU's vector execution units, you must partition the sequence dimension of the KV cache into fixed-size spatial tiles (typically 64 or 128 tokens). A single tile contains the Key and Value projections across all layers and all attention heads for that token block.

Mathematically, for a tile size $B$ (block size), number of heads $H$, and head dimension $D$, the size of a single Key or Value tile in bytes (using FP16 precision) is:

$$\text{Tile Size} = B \times H \times D \times 2 \text{ bytes}$$

For a model with 32 heads and a head dimension of 128, a block size of 64 yields a tile size of 524,288 bytes (512 KB) per layer. If your model has 32 layers, the total KV cache for just those 64 tokens is 32 MB. This immediately highlights the problem: the entire KV cache for a modest sequence length cannot reside in SRAM simultaneously.

2. Double-Buffering and Asynchronous DMA

To solve the capacity constraint, you must implement a double-buffering scheme within the SRAM scratchpad. While the NPU's tensor cores are computing the attention scores for Tile $N$, the NPU’s dedicated DMA engine must asynchronously fetch Tile $N+1$ from DRAM into a secondary SRAM buffer.

This requires dividing your allocated SRAM KV cache pool into two distinct partitions:

  • Active Buffer (Ping): Holds the KV tiles currently being read by the attention execution kernels.
  • Shadow Buffer (Pong): Receives the next sequential KV tiles from DRAM via non-blocking DMA signals.

When the attention kernel finishes processing the Ping buffer, the pointers swap. If your DMA transfer time is less than or equal to the computation time of a single tile, you completely hide the DRAM latency.

Step-by-Step: How to Optimize KV Cache Tiling in NPU SRAM for Local LLMs

Implementing this optimization pipeline requires modifying both the model's execution graph and the runtime memory allocator.

Step 1: Implement Quantized Block-KV Storage (FP4/INT4)

Do not store your KV cache in FP16. The precision loss of quantizing KV caches to 4-bit (using per-block scale and bias) can be minimized with appropriate calibration, yielding a massive 4x reduction in memory footprint and bandwidth requirements.

By using INT4/FP4 quantization, our 512 KB tile shrinks to 128 KB. This dramatically increases the number of tiles we can hold in SRAM and reduces the pressure on the DMA engine to complete its transfers before the next attention compute cycle.

Step 2: Align Tile Sizes to NPU Vector Engine Widths

NPUs utilize highly parallel vector and matrix engines. Your tile block size must be a multiple of the NPU’s vector width. For example, if the NPU processes 512-bit vector registers, your block size should be aligned to 16-element boundaries (for 32-bit datatypes) or 64-element boundaries (for 8-bit/4-bit datatypes) to prevent partial-register masking overhead.

Step 3: Map the Attention Kernel to Scratchpad Memory

Instead of relying on generic L2 cache allocation, use the NPU SDK's low-level APIs (such as Qualcomm’s QNN APIs or Apple’s Metal Performance Shaders Graph) to explicitly bind the KV tiles to the SRAM scratchpad address space. You must bypass the hardware's automatic cache eviction policies, which are too generic and often evict critical KV blocks prematurely.

Speculative Decoding and Eviction Policies

On-device speculative decoding is a key method for accelerating local LLMs. It uses a small, fast draft model to generate a sequence of candidate tokens, which are then validated in parallel by a larger target model in a single forward pass.

This paradigm complicates memory management immensely. You must now manage two distinct KV caches in SRAM simultaneously. If the target model rejects the draft tokens, those speculative KV entries must be immediately evicted to avoid polluting the cache. Managing this dynamic is the core focus of NPU SRAM Tiling and KV-Cache Eviction for On-Device Speculative Decoding architectures.

The Heavy Hitter Oracle (H2O) Eviction Strategy

When the context length grows too large, even tiled SRAM double-buffering cannot keep up with the sheer volume of data. You must evict tokens from the cache entirely. However, naive FIFO (First-In, First-Out) or LFU (Least Frequently Used) eviction algorithms can degrade model perplexity.

Instead, implement a hardware-friendly approximation of the Heavy Hitter Oracle (H2O) algorithm:

  • Identify Attention Anchors: Certain tokens (like the first token, punctuation, or highly attended-to nouns) accumulate the vast majority of attention weights. These are "Heavy Hitters."
  • Dynamic SRAM Pinning: Keep these Heavy Hitter KV tiles permanently pinned in a dedicated section of the SRAM. They are never evicted to DRAM.
  • Transient Eviction: Evict low-attention tiles to DRAM or discard them entirely. Because attention is sparse, discarding low-attention tiles results in minimal increase in model perplexity, while freeing up critical SRAM tile slots.

Draft-Target Co-Allocation

To minimize SRAM fragmentation during speculative decoding, allocate a shared contiguous block of SRAM for both the draft and target models. Because the draft model runs sequentially and the target model runs in validation bursts, you can reuse the same physical SRAM tiles for their intermediate activation tensors, swapping only the active KV pointers between the draft and target validation steps.

Comparing Modern NPU Architectures

Optimizing your tiling code requires understanding the subtle differences between the major edge silicon architectures currently in use:

NPU Platform Memory Type Best Tiling Strategy
Apple M-Series Unified System Level Cache (SLC) Explicit Metal Device-to-Device memory copy with threadgroup memory tiling.
Qualcomm Hexagon Dedicated Scratchpad Memory Asynchronous DMA via QNN HTP (Hexagon Tensor Processor) APIs with block quantization.
Intel Lunar Lake (NPU 4) On-die SRAM / Cache OpenVINO dynamic block-allocation with sub-tensor tiling and activation memory reuse.

Future Outlook: Hardware-Enforced Paging

As we look toward the next generation of edge silicon, relying solely on software-level scratchpad management is becoming unsustainable for developers. Silicon vendors are expected to introduce hardware-enforced, LLM-aware paging units directly into NPU memory controllers. These units will automatically handle KV cache tiling, quantization, and H2O-style eviction at the hardware level, presenting a unified, virtualized flat memory space to the developer.

Until those architectures land, however, the burden of optimization falls squarely on the software stack. By implementing structured spatial tiling, asynchronous double-buffering, and intelligent speculative eviction, you can bypass the memory wall entirely—turning what would be a sluggish, DRAM-bound local model into a highly efficient, SRAM-native edge intelligence powerhouse.