server.feature 2.5 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263
  1. @llama.cpp
  2. @server
  3. Feature: llama.cpp server
  4. Background: Server startup
  5. Given a server listening on localhost:8080
  6. And a model file tinyllamas/stories260K.gguf from HF repo ggml-org/models
  7. And a model alias tinyllama-2
  8. And 42 as server seed
  9. # KV Cache corresponds to the total amount of tokens
  10. # that can be stored across all independent sequences: #4130
  11. # see --ctx-size and #5568
  12. And 32 KV cache size
  13. And 512 as batch size
  14. And 1 slots
  15. And embeddings extraction
  16. And 32 server max tokens to predict
  17. And prometheus compatible metrics exposed
  18. Then the server is starting
  19. Then the server is healthy
  20. Scenario: Health
  21. Then the server is ready
  22. And all slots are idle
  23. Scenario Outline: Completion
  24. Given a prompt <prompt>
  25. And <n_predict> max tokens to predict
  26. And a completion request with no api error
  27. Then <n_predicted> tokens are predicted matching <re_content>
  28. And prometheus metrics are exposed
  29. Examples: Prompts
  30. | prompt | n_predict | re_content | n_predicted |
  31. | I believe the meaning of life is | 8 | (read\|going)+ | 8 |
  32. | Write a joke about AI | 64 | (park\|friends\|scared\|always)+ | 32 |
  33. Scenario Outline: OAI Compatibility
  34. Given a model <model>
  35. And a system prompt <system_prompt>
  36. And a user prompt <user_prompt>
  37. And <max_tokens> max tokens to predict
  38. And streaming is <enable_streaming>
  39. Given an OAI compatible chat completions request with no api error
  40. Then <n_predicted> tokens are predicted matching <re_content>
  41. Examples: Prompts
  42. | model | system_prompt | user_prompt | max_tokens | re_content | n_predicted | enable_streaming |
  43. | llama-2 | Book | What is the best book | 8 | (Mom\|what)+ | 8 | disabled |
  44. | codellama70b | You are a coding assistant. | Write the fibonacci function in c++. | 64 | (thanks\|happy\|bird)+ | 32 | enabled |
  45. Scenario: Tokenize / Detokenize
  46. When tokenizing:
  47. """
  48. What is the capital of France ?
  49. """
  50. Then tokens can be detokenize
  51. Scenario: Models available
  52. Given available models
  53. Then 1 models are supported
  54. Then model 0 is identified by tinyllama-2
  55. Then model 0 is trained on 128 tokens context