пре 3 месеци · aa8d6a21a3
--- a/CACHE_STATS_README.md
+++ b/CACHE_STATS_README.md
@@ -1,135 +0,0 @@
 
				-# Cache Statistics Feature for llama.cpp
			
 
				-
			
 
				-This document describes the cache statistics functionality added to llama.cpp for debugging and analyzing the recurrent cache behavior in models like Qwen3 Next.
			
 
				-
			
 
				-## Overview
			
 
				-
			
 
				-The cache statistics feature allows users to dump detailed information about the model's cache state after each token generation. This is particularly useful for:
			
 
				-
			
 
				-- Understanding how the recurrent cache evolves during inference
			
 
				-- Debugging cache-related issues in hybrid models (attention + recurrent)
			
 
				-- Analyzing memory usage patterns
			
 
				-- Comparing cache behavior between different models
			
 
				-
			
 
				-## Usage
			
 
				-
			
 
				-### Command Line Option
			
 
				-
			
 
				-Add the `--dump-cache` flag to any llama.cpp command to enable cache statistics printing:
			
 
				-
			
 
				-```bash
			
 
				-./llama-cli -m your_model.gguf -p "Hello, my name is" -n 10 --dump-cache
			
 
				-```
			
 
				-
			
 
				-### Test Script
			
 
				-
			
 
				-A convenient test script is provided:
			
 
				-
			
 
				-```bash
			
 
				-./test_cache_stats.sh /path/to/model.gguf "Your prompt here"
			
 
				-```
			
 
				-
			
 
				-## Output Format
			
 
				-
			
 
				-When enabled, the cache statistics are printed after each token generation:
			
 
				-
			
 
				-```
			
 
				-=== CACHE STATISTICS FOR TOKEN 1 ===
			
 
				-Model has 32 layers
			
 
				-Memory address: 0x555555555555
			
 
				-Sequence 0: pos_min=0, pos_max=5, length=6
			
 
				-Memory supports shifting: true
			
 
				-
			
 
				-Layer-by-layer cache information:
			
 
				-Note: Detailed tensor statistics require internal API access
			
 
				-This framework shows where conv/state/recurrent cache data would be displayed
			
 
				-
			
 
				-Layer 0:
			
 
				-  Conv State: [sum=N/A, mean=N/A] (shape=N/A)
			
 
				-  Recurrent State: [sum=N/A, mean=N/A] (shape=N/A)
			
 
				-  Key Cache: [sum=N/A, mean=N/A] (shape=N/A)
			
 
				-  Value Cache: [sum=N/A, mean=N/A] (shape=N/A)
			
 
				-
			
 
				-...
			
 
				-
			
 
				-To access actual cache statistics, the following would be needed:
			
 
				-1. Internal API access to llama_memory_hybrid::get_mem_recr()
			
 
				-2. Access to llama_memory_recurrent::get_r_l() and ::get_s_l() tensors
			
 
				-3. Access to llama_kv_cache tensors for attention layers
			
 
				-4. ggml_tensor data access for sum/mean calculations
			
 
				-=============================================
			
 
				-```
			
 
				-
			
 
				-## Implementation Details
			
 
				-
			
 
				-### Files Modified
			
 
				-
			
 
				-1. **tools/main/main.cpp**: Added cache statistics printing functionality
			
 
				-2. **common/common.h**: Added `dump_cache` parameter to `common_params` struct
			
 
				-3. **common/arg.cpp**: Added `--dump-cache` command line argument parsing
			
 
				-
			
 
				-### Key Functions
			
 
				-
			
 
				-- `print_cache_statistics()`: Main function that prints cache information
			
 
				-- Uses public llama.cpp APIs where available
			
 
				-- Provides framework for accessing internal cache data
			
 
				-
			
 
				-### Limitations
			
 
				-
			
 
				-The current implementation provides a framework for cache statistics but has limitations due to the public API constraints:
			
 
				-
			
 
				-1. **Tensor Data Access**: Cannot directly access tensor data (sum, mean) without internal APIs
			
 
				-2. **Layer Type Detection**: Cannot distinguish between attention and recurrent layers
			
 
				-3. **Cache Type Identification**: Limited ability to determine specific cache types
			
 
				-
			
 
				-### Future Enhancements
			
 
				-
			
 
				-To fully implement cache statistics with actual tensor data, the following would be needed:
			
 
				-
			
 
				-1. **Internal API Access**: Friend class access or new public APIs for cache internals
			
 
				-2. **Tensor Data Access**: Methods to access ggml_tensor data for calculations
			
 
				-3. **Layer Type Information**: APIs to determine layer types (attention vs recurrent)
			
 
				-4. **Cache Statistics Methods**: Built-in methods for cache statistics calculation
			
 
				-
			
 
				-## Comparison with Python Reference
			
 
				-
			
 
				-The Python reference implementation in `reference/tests/cache_stats_qwen3_next.py` provides full access to:
			
 
				-
			
 
				-- Convolution state tensors (conv_states)
			
 
				-- Recurrent state tensors (recurrent_states)  
			
 
				-- Key/value cache tensors
			
 
				-- Actual sum and mean calculations
			
 
				-
			
 
				-The C++ implementation aims to provide similar functionality once the necessary internal APIs are available.
			
 
				-
			
 
				-## Troubleshooting
			
 
				-
			
 
				-### No Cache Statistics Visible
			
 
				-
			
 
				-If cache statistics don't appear:
			
 
				-1. Ensure `--dump-cache` flag is used
			
 
				-2. Check that the model supports cache operations
			
 
				-3. Verify the model is loaded correctly
			
 
				-
			
 
				-### Memory Address Shows as Null
			
 
				-
			
 
				-This indicates no memory is allocated for the cache, which could mean:
			
 
				-- Model doesn't support caching
			
 
				-- Memory allocation failed
			
 
				-- Incorrect model type
			
 
				-
			
 
				-## Development Notes
			
 
				-
			
 
				-For developers wanting to extend this functionality:
			
 
				-
			
 
				-1. **Internal Access**: The main limitation is accessing internal cache structures
			
 
				-2. **API Design**: Consider adding public APIs for cache statistics
			
 
				-3. **Performance**: Cache statistics printing should have minimal performance impact
			
 
				-4. **Thread Safety**: Ensure thread safety when accessing cache data
			
 
				-
			
 
				-## Related Files
			
 
				-
			
 
				-- `reference/tests/cache_stats_qwen3_next.py`: Python reference implementation
			
 
				-- `src/llama-memory-hybrid.h`: Hybrid memory structure definitions
			
 
				-- `src/llama-memory-recurrent.h`: Recurrent memory structure definitions
			
 
				-- `src/llama-kv-cache.h`: KV cache structure definitions
			
--- a/test_cache_stats.sh
+++ b/test_cache_stats.sh
@@ -1,37 +0,0 @@
 
				-#!/bin/bash
			
 
				-
			
 
				-# Test script for cache statistics functionality
			
 
				-# This script demonstrates how to use the --dump-cache flag
			
 
				-
			
 
				-echo "Testing llama.cpp cache statistics functionality"
			
 
				-echo "=============================================="
			
 
				-
			
 
				-# Check if a model path is provided
			
 
				-if [ $# -eq 0 ]; then
			
 
				-    echo "Usage: $0 <path_to_model.gguf> [prompt]"
			
 
				-    echo "Example: $0 /path/to/qwen3-next.gguf \"Hello, my name is\""
			
 
				-    exit 1
			
 
				-fi
			
 
				-
			
 
				-MODEL_PATH="$1"
			
 
				-PROMPT="${2:-Hello, my name is}"
			
 
				-
			
 
				-echo "Model: $MODEL_PATH"
			
 
				-echo "Prompt: $PROMPT"
			
 
				-echo ""
			
 
				-
			
 
				-# Run llama.cpp with cache statistics enabled
			
 
				-echo "Running: ./llama-cli -m $MODEL_PATH -p \"$PROMPT\" -n 5 --dump-cache"
			
 
				-echo ""
			
 
				-
			
 
				-# Build the command
			
 
				-CMD="./build/bin/llama-cli -m $MODEL_PATH -p \"$PROMPT\" -n 5 --dump-cache"
			
 
				-
			
 
				-# Execute the command
			
 
				-echo "Executing: $CMD"
			
 
				-echo ""
			
 
				-eval $CMD
			
 
				-
			
 
				-echo ""
			
 
				-echo "Cache statistics test completed."
			
 
				-echo "=============================================="