server.feature 4.6 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
  1. @llama.cpp
  2. @server
  3. Feature: llama.cpp server
  4. Background: Server startup
  5. Given a server listening on localhost:8080
  6. And a model file tinyllamas/stories260K.gguf from HF repo ggml-org/models
  7. And a model file test-model.gguf
  8. And a model alias tinyllama-2
  9. And 42 as server seed
  10. # KV Cache corresponds to the total amount of tokens
  11. # that can be stored across all independent sequences: #4130
  12. # see --ctx-size and #5568
  13. And 256 KV cache size
  14. And 32 as batch size
  15. And 2 slots
  16. And 64 server max tokens to predict
  17. And prometheus compatible metrics exposed
  18. Then the server is starting
  19. Then the server is healthy
  20. Scenario: Health
  21. Then the server is ready
  22. And all slots are idle
  23. Scenario Outline: Completion
  24. Given a prompt <prompt>
  25. And <n_predict> max tokens to predict
  26. And a completion request with no api error
  27. Then <n_predicted> tokens are predicted matching <re_content>
  28. And the completion is <truncated> truncated
  29. And <n_prompt> prompt tokens are processed
  30. And prometheus metrics are exposed
  31. And metric llamacpp:tokens_predicted is <n_predicted>
  32. Examples: Prompts
  33. | prompt | n_predict | re_content | n_prompt | n_predicted | truncated |
  34. | I believe the meaning of life is | 8 | (read\|going)+ | 18 | 8 | not |
  35. | Write a joke about AI from a very long prompt which will not be truncated | 256 | (princesses\|everyone\|kids\|Anna\|forest)+ | 46 | 64 | not |
  36. Scenario: Completion prompt truncated
  37. Given a prompt:
  38. """
  39. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
  40. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
  41. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
  42. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
  43. """
  44. And a completion request with no api error
  45. Then 64 tokens are predicted matching fun|Annaks|popcorns|pictry|bowl
  46. And the completion is truncated
  47. And 109 prompt tokens are processed
  48. Scenario Outline: OAI Compatibility
  49. Given a model <model>
  50. And a system prompt <system_prompt>
  51. And a user prompt <user_prompt>
  52. And <max_tokens> max tokens to predict
  53. And streaming is <enable_streaming>
  54. Given an OAI compatible chat completions request with no api error
  55. Then <n_predicted> tokens are predicted matching <re_content>
  56. And <n_prompt> prompt tokens are processed
  57. And the completion is <truncated> truncated
  58. Examples: Prompts
  59. | model | system_prompt | user_prompt | max_tokens | re_content | n_prompt | n_predicted | enable_streaming | truncated |
  60. | llama-2 | Book | What is the best book | 8 | (Here\|what)+ | 77 | 8 | disabled | not |
  61. | codellama70b | You are a coding assistant. | Write the fibonacci function in c++. | 128 | (thanks\|happy\|bird\|Annabyear)+ | -1 | 64 | enabled | |
  62. Scenario Outline: OAI Compatibility w/ response format
  63. Given a model test
  64. And a system prompt test
  65. And a user prompt test
  66. And a response format <response_format>
  67. And 10 max tokens to predict
  68. Given an OAI compatible chat completions request with no api error
  69. Then <n_predicted> tokens are predicted matching <re_content>
  70. Examples: Prompts
  71. | response_format | n_predicted | re_content |
  72. | {"type": "json_object", "schema": {"const": "42"}} | 5 | "42" |
  73. | {"type": "json_object", "schema": {"items": [{"type": "integer"}]}} | 10 | \[ -300 \] |
  74. | {"type": "json_object"} | 10 | \{ " Jacky. |
  75. Scenario: Tokenize / Detokenize
  76. When tokenizing:
  77. """
  78. What is the capital of France ?
  79. """
  80. Then tokens can be detokenize
  81. Scenario: Models available
  82. Given available models
  83. Then 1 models are supported
  84. Then model 0 is identified by tinyllama-2
  85. Then model 0 is trained on 128 tokens context