|
|
@@ -1,135 +0,0 @@
|
|
|
-# Cache Statistics Feature for llama.cpp
|
|
|
-
|
|
|
-This document describes the cache statistics functionality added to llama.cpp for debugging and analyzing the recurrent cache behavior in models like Qwen3 Next.
|
|
|
-
|
|
|
-## Overview
|
|
|
-
|
|
|
-The cache statistics feature allows users to dump detailed information about the model's cache state after each token generation. This is particularly useful for:
|
|
|
-
|
|
|
-- Understanding how the recurrent cache evolves during inference
|
|
|
-- Debugging cache-related issues in hybrid models (attention + recurrent)
|
|
|
-- Analyzing memory usage patterns
|
|
|
-- Comparing cache behavior between different models
|
|
|
-
|
|
|
-## Usage
|
|
|
-
|
|
|
-### Command Line Option
|
|
|
-
|
|
|
-Add the `--dump-cache` flag to any llama.cpp command to enable cache statistics printing:
|
|
|
-
|
|
|
-```bash
|
|
|
-./llama-cli -m your_model.gguf -p "Hello, my name is" -n 10 --dump-cache
|
|
|
-```
|
|
|
-
|
|
|
-### Test Script
|
|
|
-
|
|
|
-A convenient test script is provided:
|
|
|
-
|
|
|
-```bash
|
|
|
-./test_cache_stats.sh /path/to/model.gguf "Your prompt here"
|
|
|
-```
|
|
|
-
|
|
|
-## Output Format
|
|
|
-
|
|
|
-When enabled, the cache statistics are printed after each token generation:
|
|
|
-
|
|
|
-```
|
|
|
-=== CACHE STATISTICS FOR TOKEN 1 ===
|
|
|
-Model has 32 layers
|
|
|
-Memory address: 0x555555555555
|
|
|
-Sequence 0: pos_min=0, pos_max=5, length=6
|
|
|
-Memory supports shifting: true
|
|
|
-
|
|
|
-Layer-by-layer cache information:
|
|
|
-Note: Detailed tensor statistics require internal API access
|
|
|
-This framework shows where conv/state/recurrent cache data would be displayed
|
|
|
-
|
|
|
-Layer 0:
|
|
|
- Conv State: [sum=N/A, mean=N/A] (shape=N/A)
|
|
|
- Recurrent State: [sum=N/A, mean=N/A] (shape=N/A)
|
|
|
- Key Cache: [sum=N/A, mean=N/A] (shape=N/A)
|
|
|
- Value Cache: [sum=N/A, mean=N/A] (shape=N/A)
|
|
|
-
|
|
|
-...
|
|
|
-
|
|
|
-To access actual cache statistics, the following would be needed:
|
|
|
-1. Internal API access to llama_memory_hybrid::get_mem_recr()
|
|
|
-2. Access to llama_memory_recurrent::get_r_l() and ::get_s_l() tensors
|
|
|
-3. Access to llama_kv_cache tensors for attention layers
|
|
|
-4. ggml_tensor data access for sum/mean calculations
|
|
|
-=============================================
|
|
|
-```
|
|
|
-
|
|
|
-## Implementation Details
|
|
|
-
|
|
|
-### Files Modified
|
|
|
-
|
|
|
-1. **tools/main/main.cpp**: Added cache statistics printing functionality
|
|
|
-2. **common/common.h**: Added `dump_cache` parameter to `common_params` struct
|
|
|
-3. **common/arg.cpp**: Added `--dump-cache` command line argument parsing
|
|
|
-
|
|
|
-### Key Functions
|
|
|
-
|
|
|
-- `print_cache_statistics()`: Main function that prints cache information
|
|
|
-- Uses public llama.cpp APIs where available
|
|
|
-- Provides framework for accessing internal cache data
|
|
|
-
|
|
|
-### Limitations
|
|
|
-
|
|
|
-The current implementation provides a framework for cache statistics but has limitations due to the public API constraints:
|
|
|
-
|
|
|
-1. **Tensor Data Access**: Cannot directly access tensor data (sum, mean) without internal APIs
|
|
|
-2. **Layer Type Detection**: Cannot distinguish between attention and recurrent layers
|
|
|
-3. **Cache Type Identification**: Limited ability to determine specific cache types
|
|
|
-
|
|
|
-### Future Enhancements
|
|
|
-
|
|
|
-To fully implement cache statistics with actual tensor data, the following would be needed:
|
|
|
-
|
|
|
-1. **Internal API Access**: Friend class access or new public APIs for cache internals
|
|
|
-2. **Tensor Data Access**: Methods to access ggml_tensor data for calculations
|
|
|
-3. **Layer Type Information**: APIs to determine layer types (attention vs recurrent)
|
|
|
-4. **Cache Statistics Methods**: Built-in methods for cache statistics calculation
|
|
|
-
|
|
|
-## Comparison with Python Reference
|
|
|
-
|
|
|
-The Python reference implementation in `reference/tests/cache_stats_qwen3_next.py` provides full access to:
|
|
|
-
|
|
|
-- Convolution state tensors (conv_states)
|
|
|
-- Recurrent state tensors (recurrent_states)
|
|
|
-- Key/value cache tensors
|
|
|
-- Actual sum and mean calculations
|
|
|
-
|
|
|
-The C++ implementation aims to provide similar functionality once the necessary internal APIs are available.
|
|
|
-
|
|
|
-## Troubleshooting
|
|
|
-
|
|
|
-### No Cache Statistics Visible
|
|
|
-
|
|
|
-If cache statistics don't appear:
|
|
|
-1. Ensure `--dump-cache` flag is used
|
|
|
-2. Check that the model supports cache operations
|
|
|
-3. Verify the model is loaded correctly
|
|
|
-
|
|
|
-### Memory Address Shows as Null
|
|
|
-
|
|
|
-This indicates no memory is allocated for the cache, which could mean:
|
|
|
-- Model doesn't support caching
|
|
|
-- Memory allocation failed
|
|
|
-- Incorrect model type
|
|
|
-
|
|
|
-## Development Notes
|
|
|
-
|
|
|
-For developers wanting to extend this functionality:
|
|
|
-
|
|
|
-1. **Internal Access**: The main limitation is accessing internal cache structures
|
|
|
-2. **API Design**: Consider adding public APIs for cache statistics
|
|
|
-3. **Performance**: Cache statistics printing should have minimal performance impact
|
|
|
-4. **Thread Safety**: Ensure thread safety when accessing cache data
|
|
|
-
|
|
|
-## Related Files
|
|
|
-
|
|
|
-- `reference/tests/cache_stats_qwen3_next.py`: Python reference implementation
|
|
|
-- `src/llama-memory-hybrid.h`: Hybrid memory structure definitions
|
|
|
-- `src/llama-memory-recurrent.h`: Recurrent memory structure definitions
|
|
|
-- `src/llama-kv-cache.h`: KV cache structure definitions
|