results.feature 4.4 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119
  1. @llama.cpp
  2. @results
  3. Feature: Results
  4. Background: Server startup
  5. Given a server listening on localhost:8080
  6. And a model file tinyllamas/split/stories15M-00001-of-00003.gguf from HF repo ggml-org/models
  7. And a model file test-model-00001-of-00003.gguf
  8. And 128 as batch size
  9. And 1024 KV cache size
  10. And 128 max tokens to predict
  11. And continuous batching
  12. Scenario Outline: consistent results with same seed
  13. Given <n_slots> slots
  14. And 0.0 temperature
  15. Then the server is starting
  16. Then the server is healthy
  17. Given 4 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 42
  18. Given concurrent completion requests
  19. Then the server is busy
  20. Then the server is idle
  21. And all slots are idle
  22. Then all predictions are equal
  23. Examples:
  24. | n_slots |
  25. | 1 |
  26. | 2 |
  27. Scenario Outline: different results with different seed
  28. Given <n_slots> slots
  29. And 1.0 temperature
  30. Then the server is starting
  31. Then the server is healthy
  32. Given 1 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 42
  33. Given 1 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 43
  34. Given 1 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 44
  35. Given 1 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 45
  36. Given concurrent completion requests
  37. Then the server is busy
  38. Then the server is idle
  39. And all slots are idle
  40. Then all predictions are different
  41. Examples:
  42. | n_slots |
  43. | 1 |
  44. | 2 |
  45. Scenario Outline: consistent results with same seed and varying batch size
  46. Given 4 slots
  47. And <temp> temperature
  48. # And 0 as draft
  49. Then the server is starting
  50. Then the server is healthy
  51. Given 1 prompts "Write a very long story about AI." with seed 42
  52. And concurrent completion requests
  53. # Then the server is busy # Not all slots will be utilized.
  54. Then the server is idle
  55. And all slots are idle
  56. Given <n_parallel> prompts "Write a very long story about AI." with seed 42
  57. And concurrent completion requests
  58. # Then the server is busy # Not all slots will be utilized.
  59. Then the server is idle
  60. And all slots are idle
  61. Then all predictions are equal
  62. Examples:
  63. | n_parallel | temp |
  64. | 1 | 0.0 |
  65. | 2 | 0.0 |
  66. | 4 | 0.0 |
  67. | 1 | 1.0 |
  68. # FIXME: These tests fail on master.
  69. # Problems: unified KV cache (except for CPU backend with LLAMA_NO_LLAMAFILE=1), SIMD nondeterminism.
  70. # See https://github.com/ggerganov/whisper.cpp/issues/1941#issuecomment-1986923227
  71. # and https://github.com/ggerganov/llama.cpp/pull/6122#discussion_r1531405574
  72. # and https://github.com/ggerganov/llama.cpp/pull/7347 .
  73. # | 2 | 1.0 |
  74. # | 4 | 1.0 |
  75. Scenario Outline: consistent token probs with same seed and prompt
  76. Given <n_slots> slots
  77. And <n_kv> KV cache size
  78. And 1.0 temperature
  79. And <n_predict> max tokens to predict
  80. Then the server is starting
  81. Then the server is healthy
  82. Given 1 prompts "The meaning of life is" with seed 42
  83. And concurrent completion requests
  84. # Then the server is busy # Not all slots will be utilized.
  85. Then the server is idle
  86. And all slots are idle
  87. Given <n_parallel> prompts "The meaning of life is" with seed 42
  88. And concurrent completion requests
  89. # Then the server is busy # Not all slots will be utilized.
  90. Then the server is idle
  91. And all slots are idle
  92. Then all token probabilities are equal
  93. Examples:
  94. | n_slots | n_kv | n_predict | n_parallel |
  95. | 4 | 1024 | 1 | 1 |
  96. | 4 | 1024 | 1 | 4 |
  97. # FIXME: These tests fail on master.
  98. # Problems: unified KV cache (except for CPU backend with LLAMA_NO_LLAMAFILE=1), SIMD nondeterminism.
  99. # See https://github.com/ggerganov/whisper.cpp/issues/1941#issuecomment-1986923227
  100. # and https://github.com/ggerganov/llama.cpp/pull/6122#discussion_r1531405574
  101. # and https://github.com/ggerganov/llama.cpp/pull/7347 .
  102. # | 4 | 1024 | 100 | 1 |
  103. # This test still fails even the above patches; the first token probabilities are already different.
  104. # | 4 | 1024 | 100 | 4 |