Piotr Wilkin пре 3 месеци
родитељ
комит
aa8d6a21a3
2 измењених фајлова са 0 додато и 172 уклоњено
  1. 0 135
      CACHE_STATS_README.md
  2. 0 37
      test_cache_stats.sh

+ 0 - 135
CACHE_STATS_README.md

@@ -1,135 +0,0 @@
-# Cache Statistics Feature for llama.cpp
-
-This document describes the cache statistics functionality added to llama.cpp for debugging and analyzing the recurrent cache behavior in models like Qwen3 Next.
-
-## Overview
-
-The cache statistics feature allows users to dump detailed information about the model's cache state after each token generation. This is particularly useful for:
-
-- Understanding how the recurrent cache evolves during inference
-- Debugging cache-related issues in hybrid models (attention + recurrent)
-- Analyzing memory usage patterns
-- Comparing cache behavior between different models
-
-## Usage
-
-### Command Line Option
-
-Add the `--dump-cache` flag to any llama.cpp command to enable cache statistics printing:
-
-```bash
-./llama-cli -m your_model.gguf -p "Hello, my name is" -n 10 --dump-cache
-```
-
-### Test Script
-
-A convenient test script is provided:
-
-```bash
-./test_cache_stats.sh /path/to/model.gguf "Your prompt here"
-```
-
-## Output Format
-
-When enabled, the cache statistics are printed after each token generation:
-
-```
-=== CACHE STATISTICS FOR TOKEN 1 ===
-Model has 32 layers
-Memory address: 0x555555555555
-Sequence 0: pos_min=0, pos_max=5, length=6
-Memory supports shifting: true
-
-Layer-by-layer cache information:
-Note: Detailed tensor statistics require internal API access
-This framework shows where conv/state/recurrent cache data would be displayed
-
-Layer 0:
-  Conv State: [sum=N/A, mean=N/A] (shape=N/A)
-  Recurrent State: [sum=N/A, mean=N/A] (shape=N/A)
-  Key Cache: [sum=N/A, mean=N/A] (shape=N/A)
-  Value Cache: [sum=N/A, mean=N/A] (shape=N/A)
-
-...
-
-To access actual cache statistics, the following would be needed:
-1. Internal API access to llama_memory_hybrid::get_mem_recr()
-2. Access to llama_memory_recurrent::get_r_l() and ::get_s_l() tensors
-3. Access to llama_kv_cache tensors for attention layers
-4. ggml_tensor data access for sum/mean calculations
-=============================================
-```
-
-## Implementation Details
-
-### Files Modified
-
-1. **tools/main/main.cpp**: Added cache statistics printing functionality
-2. **common/common.h**: Added `dump_cache` parameter to `common_params` struct
-3. **common/arg.cpp**: Added `--dump-cache` command line argument parsing
-
-### Key Functions
-
-- `print_cache_statistics()`: Main function that prints cache information
-- Uses public llama.cpp APIs where available
-- Provides framework for accessing internal cache data
-
-### Limitations
-
-The current implementation provides a framework for cache statistics but has limitations due to the public API constraints:
-
-1. **Tensor Data Access**: Cannot directly access tensor data (sum, mean) without internal APIs
-2. **Layer Type Detection**: Cannot distinguish between attention and recurrent layers
-3. **Cache Type Identification**: Limited ability to determine specific cache types
-
-### Future Enhancements
-
-To fully implement cache statistics with actual tensor data, the following would be needed:
-
-1. **Internal API Access**: Friend class access or new public APIs for cache internals
-2. **Tensor Data Access**: Methods to access ggml_tensor data for calculations
-3. **Layer Type Information**: APIs to determine layer types (attention vs recurrent)
-4. **Cache Statistics Methods**: Built-in methods for cache statistics calculation
-
-## Comparison with Python Reference
-
-The Python reference implementation in `reference/tests/cache_stats_qwen3_next.py` provides full access to:
-
-- Convolution state tensors (conv_states)
-- Recurrent state tensors (recurrent_states)  
-- Key/value cache tensors
-- Actual sum and mean calculations
-
-The C++ implementation aims to provide similar functionality once the necessary internal APIs are available.
-
-## Troubleshooting
-
-### No Cache Statistics Visible
-
-If cache statistics don't appear:
-1. Ensure `--dump-cache` flag is used
-2. Check that the model supports cache operations
-3. Verify the model is loaded correctly
-
-### Memory Address Shows as Null
-
-This indicates no memory is allocated for the cache, which could mean:
-- Model doesn't support caching
-- Memory allocation failed
-- Incorrect model type
-
-## Development Notes
-
-For developers wanting to extend this functionality:
-
-1. **Internal Access**: The main limitation is accessing internal cache structures
-2. **API Design**: Consider adding public APIs for cache statistics
-3. **Performance**: Cache statistics printing should have minimal performance impact
-4. **Thread Safety**: Ensure thread safety when accessing cache data
-
-## Related Files
-
-- `reference/tests/cache_stats_qwen3_next.py`: Python reference implementation
-- `src/llama-memory-hybrid.h`: Hybrid memory structure definitions
-- `src/llama-memory-recurrent.h`: Recurrent memory structure definitions
-- `src/llama-kv-cache.h`: KV cache structure definitions

+ 0 - 37
test_cache_stats.sh

@@ -1,37 +0,0 @@
-#!/bin/bash
-
-# Test script for cache statistics functionality
-# This script demonstrates how to use the --dump-cache flag
-
-echo "Testing llama.cpp cache statistics functionality"
-echo "=============================================="
-
-# Check if a model path is provided
-if [ $# -eq 0 ]; then
-    echo "Usage: $0 <path_to_model.gguf> [prompt]"
-    echo "Example: $0 /path/to/qwen3-next.gguf \"Hello, my name is\""
-    exit 1
-fi
-
-MODEL_PATH="$1"
-PROMPT="${2:-Hello, my name is}"
-
-echo "Model: $MODEL_PATH"
-echo "Prompt: $PROMPT"
-echo ""
-
-# Run llama.cpp with cache statistics enabled
-echo "Running: ./llama-cli -m $MODEL_PATH -p \"$PROMPT\" -n 5 --dump-cache"
-echo ""
-
-# Build the command
-CMD="./build/bin/llama-cli -m $MODEL_PATH -p \"$PROMPT\" -n 5 --dump-cache"
-
-# Execute the command
-echo "Executing: $CMD"
-echo ""
-eval $CMD
-
-echo ""
-echo "Cache statistics test completed."
-echo "=============================================="