This document describes the cache statistics functionality added to llama.cpp for debugging and analyzing the recurrent cache behavior in models like Qwen3 Next.
The cache statistics feature allows users to dump detailed information about the model's cache state after each token generation. This is particularly useful for:
Add the --dump-cache flag to any llama.cpp command to enable cache statistics printing:
./llama-cli -m your_model.gguf -p "Hello, my name is" -n 10 --dump-cache
A convenient test script is provided:
./test_cache_stats.sh /path/to/model.gguf "Your prompt here"
When enabled, the cache statistics are printed after each token generation:
=== CACHE STATISTICS FOR TOKEN 1 ===
Model has 32 layers
Memory address: 0x555555555555
Sequence 0: pos_min=0, pos_max=5, length=6
Memory supports shifting: true
Layer-by-layer cache information:
Note: Detailed tensor statistics require internal API access
This framework shows where conv/state/recurrent cache data would be displayed
Layer 0:
Conv State: [sum=N/A, mean=N/A] (shape=N/A)
Recurrent State: [sum=N/A, mean=N/A] (shape=N/A)
Key Cache: [sum=N/A, mean=N/A] (shape=N/A)
Value Cache: [sum=N/A, mean=N/A] (shape=N/A)
...
To access actual cache statistics, the following would be needed:
1. Internal API access to llama_memory_hybrid::get_mem_recr()
2. Access to llama_memory_recurrent::get_r_l() and ::get_s_l() tensors
3. Access to llama_kv_cache tensors for attention layers
4. ggml_tensor data access for sum/mean calculations
=============================================
dump_cache parameter to common_params struct--dump-cache command line argument parsingprint_cache_statistics(): Main function that prints cache informationThe current implementation provides a framework for cache statistics but has limitations due to the public API constraints:
To fully implement cache statistics with actual tensor data, the following would be needed:
The Python reference implementation in reference/tests/cache_stats_qwen3_next.py provides full access to:
The C++ implementation aims to provide similar functionality once the necessary internal APIs are available.
If cache statistics don't appear:
--dump-cache flag is usedThis indicates no memory is allocated for the cache, which could mean:
For developers wanting to extend this functionality:
reference/tests/cache_stats_qwen3_next.py: Python reference implementationsrc/llama-memory-hybrid.h: Hybrid memory structure definitionssrc/llama-memory-recurrent.h: Recurrent memory structure definitionssrc/llama-kv-cache.h: KV cache structure definitions