server.feature 2.6 KB

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364
  1. @llama.cpp
  2. @server
  3. Feature: llama.cpp server
  4. Background: Server startup
  5. Given a server listening on localhost:8080
  6. And a model file tinyllamas/stories260K.gguf from HF repo ggml-org/models
  7. And a model alias tinyllama-2
  8. And 42 as server seed
  9. # KV Cache corresponds to the total amount of tokens
  10. # that can be stored across all independent sequences: #4130
  11. # see --ctx-size and #5568
  12. And 32 KV cache size
  13. And 512 as batch size
  14. And 1 slots
  15. And embeddings extraction
  16. And 32 server max tokens to predict
  17. And prometheus compatible metrics exposed
  18. Then the server is starting
  19. Then the server is healthy
  20. Scenario: Health
  21. Then the server is ready
  22. And all slots are idle
  23. Scenario Outline: Completion
  24. Given a prompt <prompt>
  25. And <n_predict> max tokens to predict
  26. And a completion request with no api error
  27. Then <n_predicted> tokens are predicted matching <re_content>
  28. And prometheus metrics are exposed
  29. And metric llamacpp:tokens_predicted is <n_predicted>
  30. Examples: Prompts
  31. | prompt | n_predict | re_content | n_predicted |
  32. | I believe the meaning of life is | 8 | (read\|going)+ | 8 |
  33. | Write a joke about AI | 64 | (park\|friends\|scared\|always)+ | 32 |
  34. Scenario Outline: OAI Compatibility
  35. Given a model <model>
  36. And a system prompt <system_prompt>
  37. And a user prompt <user_prompt>
  38. And <max_tokens> max tokens to predict
  39. And streaming is <enable_streaming>
  40. Given an OAI compatible chat completions request with no api error
  41. Then <n_predicted> tokens are predicted matching <re_content>
  42. Examples: Prompts
  43. | model | system_prompt | user_prompt | max_tokens | re_content | n_predicted | enable_streaming |
  44. | llama-2 | Book | What is the best book | 8 | (Mom\|what)+ | 8 | disabled |
  45. | codellama70b | You are a coding assistant. | Write the fibonacci function in c++. | 64 | (thanks\|happy\|bird)+ | 32 | enabled |
  46. Scenario: Tokenize / Detokenize
  47. When tokenizing:
  48. """
  49. What is the capital of France ?
  50. """
  51. Then tokens can be detokenize
  52. Scenario: Models available
  53. Given available models
  54. Then 1 models are supported
  55. Then model 0 is identified by tinyllama-2
  56. Then model 0 is trained on 128 tokens context