Cache Statistics Feature for llama.cpp

This document describes the cache statistics functionality added to llama.cpp for debugging and analyzing the recurrent cache behavior in models like Qwen3 Next.

Overview

The cache statistics feature allows users to dump detailed information about the model's cache state after each token generation. This is particularly useful for:

Understanding how the recurrent cache evolves during inference
Debugging cache-related issues in hybrid models (attention + recurrent)
Analyzing memory usage patterns
Comparing cache behavior between different models

Usage

Command Line Option

Add the --dump-cache flag to any llama.cpp command to enable cache statistics printing:

./llama-cli -m your_model.gguf -p "Hello, my name is" -n 10 --dump-cache

Test Script

A convenient test script is provided:

./test_cache_stats.sh /path/to/model.gguf "Your prompt here"

Output Format

When enabled, the cache statistics are printed after each token generation:

=== CACHE STATISTICS FOR TOKEN 1 ===
Model has 32 layers
Memory address: 0x555555555555
Sequence 0: pos_min=0, pos_max=5, length=6
Memory supports shifting: true

Layer-by-layer cache information:
Note: Detailed tensor statistics require internal API access
This framework shows where conv/state/recurrent cache data would be displayed

Layer 0:
  Conv State: [sum=N/A, mean=N/A] (shape=N/A)
  Recurrent State: [sum=N/A, mean=N/A] (shape=N/A)
  Key Cache: [sum=N/A, mean=N/A] (shape=N/A)
  Value Cache: [sum=N/A, mean=N/A] (shape=N/A)

...

To access actual cache statistics, the following would be needed:
1. Internal API access to llama_memory_hybrid::get_mem_recr()
2. Access to llama_memory_recurrent::get_r_l() and ::get_s_l() tensors
3. Access to llama_kv_cache tensors for attention layers
4. ggml_tensor data access for sum/mean calculations
=============================================

Implementation Details

Files Modified

tools/main/main.cpp: Added cache statistics printing functionality
common/common.h: Added dump_cache parameter to common_params struct
common/arg.cpp: Added --dump-cache command line argument parsing

Key Functions

print_cache_statistics(): Main function that prints cache information
Uses public llama.cpp APIs where available
Provides framework for accessing internal cache data

Limitations

The current implementation provides a framework for cache statistics but has limitations due to the public API constraints:

Tensor Data Access: Cannot directly access tensor data (sum, mean) without internal APIs
Layer Type Detection: Cannot distinguish between attention and recurrent layers
Cache Type Identification: Limited ability to determine specific cache types

Future Enhancements

To fully implement cache statistics with actual tensor data, the following would be needed:

Internal API Access: Friend class access or new public APIs for cache internals
Tensor Data Access: Methods to access ggml_tensor data for calculations
Layer Type Information: APIs to determine layer types (attention vs recurrent)
Cache Statistics Methods: Built-in methods for cache statistics calculation

Comparison with Python Reference

The Python reference implementation in reference/tests/cache_stats_qwen3_next.py provides full access to:

Convolution state tensors (conv_states)
Recurrent state tensors (recurrent_states)
Key/value cache tensors
Actual sum and mean calculations

The C++ implementation aims to provide similar functionality once the necessary internal APIs are available.

Troubleshooting

No Cache Statistics Visible

If cache statistics don't appear:

Ensure --dump-cache flag is used
Check that the model supports cache operations
Verify the model is loaded correctly

Memory Address Shows as Null

This indicates no memory is allocated for the cache, which could mean:

Model doesn't support caching
Memory allocation failed
Incorrect model type

Development Notes

For developers wanting to extend this functionality:

Internal Access: The main limitation is accessing internal cache structures
API Design: Consider adding public APIs for cache statistics
Performance: Cache statistics printing should have minimal performance impact
Thread Safety: Ensure thread safety when accessing cache data

Related Files

reference/tests/cache_stats_qwen3_next.py: Python reference implementation
src/llama-memory-hybrid.h: Hybrid memory structure definitions
src/llama-memory-recurrent.h: Recurrent memory structure definitions
src/llama-kv-cache.h: KV cache structure definitions

CACHE_STATS_README.md 4.7 KB Cronologia Originale