server.feature 3.7 KB

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283
  1. @llama.cpp
  2. @server
  3. Feature: llama.cpp server
  4. Background: Server startup
  5. Given a server listening on localhost:8080
  6. And a model file tinyllamas/stories260K.gguf from HF repo ggml-org/models
  7. And a model alias tinyllama-2
  8. And 42 as server seed
  9. # KV Cache corresponds to the total amount of tokens
  10. # that can be stored across all independent sequences: #4130
  11. # see --ctx-size and #5568
  12. And 256 KV cache size
  13. And 32 as batch size
  14. And 2 slots
  15. And 64 server max tokens to predict
  16. And prometheus compatible metrics exposed
  17. Then the server is starting
  18. Then the server is healthy
  19. Scenario: Health
  20. Then the server is ready
  21. And all slots are idle
  22. Scenario Outline: Completion
  23. Given a prompt <prompt>
  24. And <n_predict> max tokens to predict
  25. And a completion request with no api error
  26. Then <n_predicted> tokens are predicted matching <re_content>
  27. And the completion is <truncated> truncated
  28. And <n_prompt> prompt tokens are processed
  29. And prometheus metrics are exposed
  30. And metric llamacpp:tokens_predicted is <n_predicted>
  31. Examples: Prompts
  32. | prompt | n_predict | re_content | n_prompt | n_predicted | truncated |
  33. | I believe the meaning of life is | 8 | (read\|going)+ | 18 | 8 | not |
  34. | Write a joke about AI from a very long prompt which will not be truncated | 256 | (princesses\|everyone\|kids)+ | 46 | 64 | not |
  35. Scenario: Completion prompt truncated
  36. Given a prompt:
  37. """
  38. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
  39. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
  40. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
  41. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
  42. """
  43. And a completion request with no api error
  44. Then 64 tokens are predicted matching fun|Annaks|popcorns
  45. And the completion is truncated
  46. And 109 prompt tokens are processed
  47. Scenario Outline: OAI Compatibility
  48. Given a model <model>
  49. And a system prompt <system_prompt>
  50. And a user prompt <user_prompt>
  51. And <max_tokens> max tokens to predict
  52. And streaming is <enable_streaming>
  53. Given an OAI compatible chat completions request with no api error
  54. Then <n_predicted> tokens are predicted matching <re_content>
  55. And <n_prompt> prompt tokens are processed
  56. And the completion is <truncated> truncated
  57. Examples: Prompts
  58. | model | system_prompt | user_prompt | max_tokens | re_content | n_prompt | n_predicted | enable_streaming | truncated |
  59. | llama-2 | Book | What is the best book | 8 | (Here\|what)+ | 77 | 8 | disabled | not |
  60. | codellama70b | You are a coding assistant. | Write the fibonacci function in c++. | 128 | (thanks\|happy\|bird)+ | -1 | 64 | enabled | |
  61. Scenario: Tokenize / Detokenize
  62. When tokenizing:
  63. """
  64. What is the capital of France ?
  65. """
  66. Then tokens can be detokenize
  67. Scenario: Models available
  68. Given available models
  69. Then 1 models are supported
  70. Then model 0 is identified by tinyllama-2
  71. Then model 0 is trained on 128 tokens context