| 12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758 |
- @llama.cpp
- @slotsave
- Feature: llama.cpp server slot management
- Background: Server startup
- Given a server listening on localhost:8080
- And a model file tinyllamas/stories260K.gguf from HF repo ggml-org/models
- And prompt caching is enabled
- And 2 slots
- And . as slot save path
- And 2048 KV cache size
- And 42 as server seed
- And 24 max tokens to predict
- Then the server is starting
- Then the server is healthy
- Scenario: Save and Restore Slot
- # First prompt in slot 1 should be fully processed
- Given a user prompt "What is the capital of France?"
- And using slot id 1
- And a completion request with no api error
- Then 24 tokens are predicted matching (Lily|cake)
- And 22 prompt tokens are processed
- When the slot 1 is saved with filename "slot1.bin"
- Then the server responds with status code 200
- # Since we have cache, this should only process the last tokens
- Given a user prompt "What is the capital of Germany?"
- And a completion request with no api error
- Then 24 tokens are predicted matching (Thank|special|Lily)
- And 7 prompt tokens are processed
- # Loading the original cache into slot 0,
- # we should only be processing 1 prompt token and get the same output
- When the slot 0 is restored with filename "slot1.bin"
- Then the server responds with status code 200
- Given a user prompt "What is the capital of France?"
- And using slot id 0
- And a completion request with no api error
- Then 24 tokens are predicted matching (Lily|cake)
- And 1 prompt tokens are processed
- # For verification that slot 1 was not corrupted during slot 0 load, same thing
- Given a user prompt "What is the capital of Germany?"
- And using slot id 1
- And a completion request with no api error
- Then 24 tokens are predicted matching (Thank|special|Lily)
- And 1 prompt tokens are processed
- Scenario: Erase Slot
- Given a user prompt "What is the capital of France?"
- And using slot id 1
- And a completion request with no api error
- Then 24 tokens are predicted matching (Lily|cake)
- And 22 prompt tokens are processed
- When the slot 1 is erased
- Then the server responds with status code 200
- Given a user prompt "What is the capital of France?"
- And a completion request with no api error
- Then 24 tokens are predicted matching (Lily|cake)
- And 22 prompt tokens are processed
|