slotsave.feature 2.4 KB

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
  1. @llama.cpp
  2. @slotsave
  3. Feature: llama.cpp server slot management
  4. Background: Server startup
  5. Given a server listening on localhost:8080
  6. And a model file tinyllamas/stories260K.gguf from HF repo ggml-org/models
  7. And prompt caching is enabled
  8. And 2 slots
  9. And . as slot save path
  10. And 2048 KV cache size
  11. And 42 as server seed
  12. And 24 max tokens to predict
  13. Then the server is starting
  14. Then the server is healthy
  15. Scenario: Save and Restore Slot
  16. # First prompt in slot 1 should be fully processed
  17. Given a user prompt "What is the capital of France?"
  18. And using slot id 1
  19. And a completion request with no api error
  20. Then 24 tokens are predicted matching (Lily|cake)
  21. And 22 prompt tokens are processed
  22. When the slot 1 is saved with filename "slot1.bin"
  23. Then the server responds with status code 200
  24. # Since we have cache, this should only process the last tokens
  25. Given a user prompt "What is the capital of Germany?"
  26. And a completion request with no api error
  27. Then 24 tokens are predicted matching (Thank|special)
  28. And 7 prompt tokens are processed
  29. # Loading the original cache into slot 0,
  30. # we should only be processing 1 prompt token and get the same output
  31. When the slot 0 is restored with filename "slot1.bin"
  32. Then the server responds with status code 200
  33. Given a user prompt "What is the capital of France?"
  34. And using slot id 0
  35. And a completion request with no api error
  36. Then 24 tokens are predicted matching (Lily|cake)
  37. And 1 prompt tokens are processed
  38. # For verification that slot 1 was not corrupted during slot 0 load, same thing
  39. Given a user prompt "What is the capital of Germany?"
  40. And using slot id 1
  41. And a completion request with no api error
  42. Then 24 tokens are predicted matching (Thank|special)
  43. And 1 prompt tokens are processed
  44. Scenario: Erase Slot
  45. Given a user prompt "What is the capital of France?"
  46. And using slot id 1
  47. And a completion request with no api error
  48. Then 24 tokens are predicted matching (Lily|cake)
  49. And 22 prompt tokens are processed
  50. When the slot 1 is erased
  51. Then the server responds with status code 200
  52. Given a user prompt "What is the capital of France?"
  53. And a completion request with no api error
  54. Then 24 tokens are predicted matching (Lily|cake)
  55. And 22 prompt tokens are processed