1
0

parallel.feature 3.6 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146
  1. @llama.cpp
  2. @parallel
  3. Feature: Parallel
  4. Background: Server startup
  5. Given a server listening on localhost:8080
  6. And a model file tinyllamas/stories260K.gguf from HF repo ggml-org/models
  7. And 42 as server seed
  8. And 512 as batch size
  9. And 64 KV cache size
  10. And 2 slots
  11. And embeddings extraction
  12. And continuous batching
  13. Then the server is starting
  14. Then the server is healthy
  15. Scenario Outline: Multi users completion
  16. Given a prompt:
  17. """
  18. Write a very long story about AI.
  19. """
  20. And a prompt:
  21. """
  22. Write another very long music lyrics.
  23. """
  24. And <n_predict> max tokens to predict
  25. Given concurrent completion requests
  26. Then the server is busy
  27. Then the server is idle
  28. And all slots are idle
  29. Then all prompts are predicted with <n_predict> tokens
  30. Examples:
  31. | n_predict |
  32. | 128 |
  33. Scenario Outline: Multi users OAI completions compatibility
  34. Given a system prompt You are a writer.
  35. And a model tinyllama-2
  36. Given a prompt:
  37. """
  38. Write a very long book.
  39. """
  40. And a prompt:
  41. """
  42. Write another a poem.
  43. """
  44. And <n_predict> max tokens to predict
  45. And streaming is <streaming>
  46. Given concurrent OAI completions requests
  47. Then the server is busy
  48. Then the server is idle
  49. Then all prompts are predicted with <n_predict> tokens
  50. Examples:
  51. | streaming | n_predict |
  52. | disabled | 128 |
  53. | enabled | 64 |
  54. Scenario Outline: Multi users OAI completions compatibility no v1
  55. Given a system prompt You are a writer.
  56. And a model tinyllama-2
  57. Given a prompt:
  58. """
  59. Write a very long book.
  60. """
  61. And a prompt:
  62. """
  63. Write another a poem.
  64. """
  65. And <n_predict> max tokens to predict
  66. And streaming is <streaming>
  67. Given concurrent OAI completions requests no v1
  68. Then the server is busy
  69. Then the server is idle
  70. Then all prompts are predicted with <n_predict> tokens
  71. Examples:
  72. | streaming | n_predict |
  73. | disabled | 128 |
  74. | enabled | 64 |
  75. Scenario: Multi users with total number of tokens to predict exceeds the KV Cache size #3969
  76. Given a prompt:
  77. """
  78. Write a very long story about AI.
  79. """
  80. And a prompt:
  81. """
  82. Write another very long music lyrics.
  83. """
  84. And a prompt:
  85. """
  86. Write a very long poem.
  87. """
  88. And a prompt:
  89. """
  90. Write a very long joke.
  91. """
  92. And 128 max tokens to predict
  93. Given concurrent completion requests
  94. Then the server is busy
  95. Then the server is idle
  96. Then all prompts are predicted
  97. Scenario: Multi users embeddings
  98. Given a prompt:
  99. """
  100. Write a very long story about AI.
  101. """
  102. And a prompt:
  103. """
  104. Write another very long music lyrics.
  105. """
  106. And a prompt:
  107. """
  108. Write a very long poem.
  109. """
  110. And a prompt:
  111. """
  112. Write a very long joke.
  113. """
  114. Given concurrent embedding requests
  115. Then the server is busy
  116. Then the server is idle
  117. Then all embeddings are generated
  118. Scenario: Multi users OAI compatibility embeddings
  119. Given a prompt:
  120. """
  121. In which country Paris is located ?
  122. """
  123. And a prompt:
  124. """
  125. Is Madrid the capital of Spain ?
  126. """
  127. And a prompt:
  128. """
  129. What is the biggest US city ?
  130. """
  131. And a prompt:
  132. """
  133. What is the capital of Bulgaria ?
  134. """
  135. And a model tinyllama-2
  136. Given concurrent OAI embedding requests
  137. Then the server is busy
  138. Then the server is idle
  139. Then all embeddings are generated