server.feature 4.9 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112
  1. @llama.cpp
  2. @server
  3. Feature: llama.cpp server
  4. Background: Server startup
  5. Given a server listening on localhost:8080
  6. And a model file tinyllamas/stories260K.gguf from HF repo ggml-org/models
  7. And a model file test-model.gguf
  8. And a model alias tinyllama-2
  9. And BOS token is 1
  10. And 42 as server seed
  11. # KV Cache corresponds to the total amount of tokens
  12. # that can be stored across all independent sequences: #4130
  13. # see --ctx-size and #5568
  14. And 256 KV cache size
  15. And 32 as batch size
  16. And 2 slots
  17. And 64 server max tokens to predict
  18. And prometheus compatible metrics exposed
  19. Then the server is starting
  20. Then the server is healthy
  21. Scenario: Health
  22. Then the server is ready
  23. And all slots are idle
  24. Scenario Outline: Completion
  25. Given a prompt <prompt>
  26. And <n_predict> max tokens to predict
  27. And a completion request with no api error
  28. Then <n_predicted> tokens are predicted matching <re_content>
  29. And the completion is <truncated> truncated
  30. And <n_prompt> prompt tokens are processed
  31. And prometheus metrics are exposed
  32. And metric llamacpp:tokens_predicted is <n_predicted>
  33. Examples: Prompts
  34. | prompt | n_predict | re_content | n_prompt | n_predicted | truncated |
  35. | I believe the meaning of life is | 8 | (read\|going\|pretty)+ | 18 | 8 | not |
  36. | Write a joke about AI from a very long prompt which will not be truncated | 256 | (princesses\|everyone\|kids\|Anna\|forest)+ | 45 | 64 | not |
  37. Scenario: Completion prompt truncated
  38. Given a prompt:
  39. """
  40. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
  41. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
  42. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
  43. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
  44. """
  45. And a completion request with no api error
  46. Then 64 tokens are predicted matching fun|Annaks|popcorns|pictry|bowl
  47. And the completion is truncated
  48. And 109 prompt tokens are processed
  49. Scenario Outline: OAI Compatibility
  50. Given a model <model>
  51. And a system prompt <system_prompt>
  52. And a user prompt <user_prompt>
  53. And <max_tokens> max tokens to predict
  54. And streaming is <enable_streaming>
  55. Given an OAI compatible chat completions request with no api error
  56. Then <n_predicted> tokens are predicted matching <re_content>
  57. And <n_prompt> prompt tokens are processed
  58. And the completion is <truncated> truncated
  59. Examples: Prompts
  60. | model | system_prompt | user_prompt | max_tokens | re_content | n_prompt | n_predicted | enable_streaming | truncated |
  61. | llama-2 | Book | What is the best book | 8 | (Here\|what)+ | 76 | 8 | disabled | not |
  62. | codellama70b | You are a coding assistant. | Write the fibonacci function in c++. | 128 | (thanks\|happy\|bird\|fireplace)+ | -1 | 64 | enabled | |
  63. Scenario Outline: OAI Compatibility w/ response format
  64. Given a model test
  65. And a system prompt test
  66. And a user prompt test
  67. And a response format <response_format>
  68. And 10 max tokens to predict
  69. Given an OAI compatible chat completions request with no api error
  70. Then <n_predicted> tokens are predicted matching <re_content>
  71. Examples: Prompts
  72. | response_format | n_predicted | re_content |
  73. | {"type": "json_object", "schema": {"const": "42"}} | 5 | "42" |
  74. | {"type": "json_object", "schema": {"items": [{"type": "integer"}]}} | 10 | \[ -300 \] |
  75. | {"type": "json_object"} | 10 | \{ " Saragine. |
  76. Scenario: Tokenize / Detokenize
  77. When tokenizing:
  78. """
  79. What is the capital of France ?
  80. """
  81. Then tokens can be detokenized
  82. And tokens do not begin with BOS
  83. Scenario: Tokenize w/ BOS
  84. Given adding special tokens
  85. When tokenizing:
  86. """
  87. What is the capital of Germany?
  88. """
  89. Then tokens begin with BOS
  90. Given first token is removed
  91. Then tokens can be detokenized
  92. Scenario: Models available
  93. Given available models
  94. Then 1 models are supported
  95. Then model 0 is identified by tinyllama-2
  96. Then model 0 is trained on 128 tokens context