ctx_shift.feature 2.8 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566
  1. @llama.cpp
  2. @ctx_shift
  3. Feature: llama.cpp server
  4. Background: Server startup
  5. Given a server listening on localhost:8080
  6. And a model file tinyllamas/stories260K.gguf from HF repo ggml-org/models
  7. And a model file test-model.gguf
  8. And a model alias tinyllama-2
  9. And BOS token is 1
  10. And 42 as server seed
  11. And 256 KV cache size
  12. And 32 as batch size
  13. And 2 slots
  14. # the prompt is 301 tokens
  15. # the slot context is 256/2 = 128 tokens
  16. # the prompt is truncated to keep the last 109 tokens
  17. # 64 tokens are generated thanks to shifting the context when it gets full
  18. Scenario: Inference with context shift
  19. And 64 server max tokens to predict
  20. Then the server is starting
  21. Then the server is healthy
  22. Given a prompt:
  23. """
  24. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
  25. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
  26. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
  27. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
  28. """
  29. And a completion request with no api error
  30. Then 64 tokens are predicted matching fun|Annaks|popcorns|pictry|bowl
  31. And the completion is truncated
  32. And 109 prompt tokens are processed
  33. Scenario Outline: Inference without context shift
  34. And <n_predict> server max tokens to predict
  35. And disable context shifting
  36. Then the server is starting
  37. Then the server is healthy
  38. Given a prompt:
  39. """
  40. Hi how are you
  41. """
  42. And a completion request with no api error
  43. Then <n_token_output> tokens are predicted matching twind|Anna
  44. And the completion is <truncated> truncated
  45. And 8 prompt tokens are processed
  46. Examples:
  47. | n_predict | n_token_output | truncated |
  48. | 64 | 64 | not |
  49. | -1 | 120 | |
  50. Scenario: Inference without context shift (expected error: prompt too long)
  51. And disable context shifting
  52. Then the server is starting
  53. Then the server is healthy
  54. Given a prompt:
  55. """
  56. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
  57. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
  58. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
  59. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
  60. """
  61. And a completion request with 400 api error