parallel.feature 3.6 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145
  1. @llama.cpp
  2. Feature: Parallel
  3. Background: Server startup
  4. Given a server listening on localhost:8080
  5. And a model file stories260K.gguf
  6. And a model alias tinyllama-2
  7. And 42 as server seed
  8. And 64 KV cache size
  9. And 2 slots
  10. And embeddings extraction
  11. And continuous batching
  12. Then the server is starting
  13. Then the server is healthy
  14. Scenario Outline: Multi users completion
  15. Given a prompt:
  16. """
  17. Write a very long story about AI.
  18. """
  19. And a prompt:
  20. """
  21. Write another very long music lyrics.
  22. """
  23. And <n_predict> max tokens to predict
  24. Given concurrent completion requests
  25. Then the server is busy
  26. Then the server is idle
  27. And all slots are idle
  28. Then all prompts are predicted with <n_predict> tokens
  29. Examples:
  30. | n_predict |
  31. | 128 |
  32. Scenario Outline: Multi users OAI completions compatibility
  33. Given a system prompt You are a writer.
  34. And a model tinyllama-2
  35. Given a prompt:
  36. """
  37. Write a very long book.
  38. """
  39. And a prompt:
  40. """
  41. Write another a poem.
  42. """
  43. And <n_predict> max tokens to predict
  44. And streaming is <streaming>
  45. Given concurrent OAI completions requests
  46. Then the server is busy
  47. Then the server is idle
  48. Then all prompts are predicted with <n_predict> tokens
  49. Examples:
  50. | streaming | n_predict |
  51. | disabled | 128 |
  52. | enabled | 64 |
  53. Scenario Outline: Multi users OAI completions compatibility no v1
  54. Given a system prompt You are a writer.
  55. And a model tinyllama-2
  56. Given a prompt:
  57. """
  58. Write a very long book.
  59. """
  60. And a prompt:
  61. """
  62. Write another a poem.
  63. """
  64. And <n_predict> max tokens to predict
  65. And streaming is <streaming>
  66. Given concurrent OAI completions requests no v1
  67. Then the server is busy
  68. Then the server is idle
  69. Then all prompts are predicted with <n_predict> tokens
  70. Examples:
  71. | streaming | n_predict |
  72. | disabled | 128 |
  73. | enabled | 64 |
  74. Scenario: Multi users with total number of tokens to predict exceeds the KV Cache size #3969
  75. Given a prompt:
  76. """
  77. Write a very long story about AI.
  78. """
  79. And a prompt:
  80. """
  81. Write another very long music lyrics.
  82. """
  83. And a prompt:
  84. """
  85. Write a very long poem.
  86. """
  87. And a prompt:
  88. """
  89. Write a very long joke.
  90. """
  91. And 128 max tokens to predict
  92. Given concurrent completion requests
  93. Then the server is busy
  94. Then the server is idle
  95. Then all prompts are predicted
  96. Scenario: Multi users embeddings
  97. Given a prompt:
  98. """
  99. Write a very long story about AI.
  100. """
  101. And a prompt:
  102. """
  103. Write another very long music lyrics.
  104. """
  105. And a prompt:
  106. """
  107. Write a very long poem.
  108. """
  109. And a prompt:
  110. """
  111. Write a very long joke.
  112. """
  113. Given concurrent embedding requests
  114. Then the server is busy
  115. Then the server is idle
  116. Then all embeddings are generated
  117. Scenario: Multi users OAI compatibility embeddings
  118. Given a prompt:
  119. """
  120. In which country Paris is located ?
  121. """
  122. And a prompt:
  123. """
  124. Is Madrid the capital of Spain ?
  125. """
  126. And a prompt:
  127. """
  128. What is the biggest US city ?
  129. """
  130. And a prompt:
  131. """
  132. What is the capital of Bulgaria ?
  133. """
  134. And a model tinyllama-2
  135. Given concurrent OAI embedding requests
  136. Then the server is busy
  137. Then the server is idle
  138. Then all embeddings are generated