server.feature 3.0 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384
  1. @llama.cpp
  2. Feature: llama.cpp server
  3. Background: Server startup
  4. Given a server listening on localhost:8080
  5. And a model file stories260K.gguf
  6. And a model alias tinyllama-2
  7. And 42 as server seed
  8. # KV Cache corresponds to the total amount of tokens
  9. # that can be stored across all independent sequences: #4130
  10. # see --ctx-size and #5568
  11. And 32 KV cache size
  12. And 1 slots
  13. And embeddings extraction
  14. And 32 server max tokens to predict
  15. And prometheus compatible metrics exposed
  16. Then the server is starting
  17. Then the server is healthy
  18. Scenario: Health
  19. Then the server is ready
  20. And all slots are idle
  21. Scenario Outline: Completion
  22. Given a prompt <prompt>
  23. And <n_predict> max tokens to predict
  24. And a completion request with no api error
  25. Then <n_predicted> tokens are predicted matching <re_content>
  26. And prometheus metrics are exposed
  27. Examples: Prompts
  28. | prompt | n_predict | re_content | n_predicted |
  29. | I believe the meaning of life is | 8 | read | 8 |
  30. | Write a joke about AI | 64 | (park<or>friends<or>scared)+ | 32 |
  31. Scenario Outline: OAI Compatibility
  32. Given a model <model>
  33. And a system prompt <system_prompt>
  34. And a user prompt <user_prompt>
  35. And <max_tokens> max tokens to predict
  36. And streaming is <enable_streaming>
  37. Given an OAI compatible chat completions request with no api error
  38. Then <n_predicted> tokens are predicted matching <re_content>
  39. Examples: Prompts
  40. | model | system_prompt | user_prompt | max_tokens | re_content | n_predicted | enable_streaming |
  41. | llama-2 | Book | What is the best book | 8 | (Mom<or>what)+ | 8 | disabled |
  42. | codellama70b | You are a coding assistant. | Write the fibonacci function in c++. | 64 | (thanks<or>happy<or>bird)+ | 32 | enabled |
  43. Scenario: Embedding
  44. When embeddings are computed for:
  45. """
  46. What is the capital of Bulgaria ?
  47. """
  48. Then embeddings are generated
  49. Scenario: OAI Embeddings compatibility
  50. Given a model tinyllama-2
  51. When an OAI compatible embeddings computation request for:
  52. """
  53. What is the capital of Spain ?
  54. """
  55. Then embeddings are generated
  56. Scenario: OAI Embeddings compatibility with multiple inputs
  57. Given a model tinyllama-2
  58. Given a prompt:
  59. """
  60. In which country Paris is located ?
  61. """
  62. And a prompt:
  63. """
  64. Is Madrid the capital of Spain ?
  65. """
  66. When an OAI compatible embeddings computation request for multiple inputs
  67. Then embeddings are generated
  68. Scenario: Tokenize / Detokenize
  69. When tokenizing:
  70. """
  71. What is the capital of France ?
  72. """
  73. Then tokens can be detokenize