results.feature 4.3 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117
  1. @llama.cpp
  2. @results
  3. Feature: Results
  4. Background: Server startup
  5. Given a server listening on localhost:8080
  6. And a model file tinyllamas/split/stories15M-00001-of-00003.gguf from HF repo ggml-org/models
  7. And a model file test-model-00001-of-00003.gguf
  8. And 128 as batch size
  9. And 1024 KV cache size
  10. And 128 max tokens to predict
  11. And continuous batching
  12. Scenario Outline: consistent results with same seed
  13. Given <n_slots> slots
  14. Then the server is starting
  15. Then the server is healthy
  16. Given 4 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 42
  17. Given concurrent completion requests
  18. Then the server is busy
  19. Then the server is idle
  20. And all slots are idle
  21. Then all predictions are equal
  22. Examples:
  23. | n_slots |
  24. | 1 |
  25. | 2 |
  26. Scenario Outline: different results with different seed
  27. Given <n_slots> slots
  28. Then the server is starting
  29. Then the server is healthy
  30. Given 1 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 42
  31. Given 1 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 43
  32. Given 1 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 44
  33. Given 1 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 45
  34. Given concurrent completion requests
  35. Then the server is busy
  36. Then the server is idle
  37. And all slots are idle
  38. Then all predictions are different
  39. Examples:
  40. | n_slots |
  41. | 1 |
  42. | 2 |
  43. Scenario Outline: consistent results with same seed and varying batch size
  44. Given 4 slots
  45. And <temp> temperature
  46. # And 0 as draft
  47. Then the server is starting
  48. Then the server is healthy
  49. Given 1 prompts "Write a very long story about AI." with seed 42
  50. And concurrent completion requests
  51. # Then the server is busy # Not all slots will be utilized.
  52. Then the server is idle
  53. And all slots are idle
  54. Given <n_parallel> prompts "Write a very long story about AI." with seed 42
  55. And concurrent completion requests
  56. # Then the server is busy # Not all slots will be utilized.
  57. Then the server is idle
  58. And all slots are idle
  59. Then all predictions are equal
  60. Examples:
  61. | n_parallel | temp |
  62. | 1 | 0.0 |
  63. | 2 | 0.0 |
  64. | 4 | 0.0 |
  65. | 1 | 1.0 |
  66. # FIXME: These tests fail on master.
  67. # Problems: unified KV cache (except for CPU backend with LLAMA_NO_LLAMAFILE=1), SIMD nondeterminism.
  68. # See https://github.com/ggerganov/whisper.cpp/issues/1941#issuecomment-1986923227
  69. # and https://github.com/ggerganov/llama.cpp/pull/6122#discussion_r1531405574
  70. # and https://github.com/ggerganov/llama.cpp/pull/7347 .
  71. # | 2 | 1.0 |
  72. # | 4 | 1.0 |
  73. Scenario Outline: consistent token probs with same seed and prompt
  74. Given <n_slots> slots
  75. And <n_kv> KV cache size
  76. And 1.0 temperature
  77. And <n_predict> max tokens to predict
  78. Then the server is starting
  79. Then the server is healthy
  80. Given 1 prompts "The meaning of life is" with seed 42
  81. And concurrent completion requests
  82. # Then the server is busy # Not all slots will be utilized.
  83. Then the server is idle
  84. And all slots are idle
  85. Given <n_parallel> prompts "The meaning of life is" with seed 42
  86. And concurrent completion requests
  87. # Then the server is busy # Not all slots will be utilized.
  88. Then the server is idle
  89. And all slots are idle
  90. Then all token probabilities are equal
  91. Examples:
  92. | n_slots | n_kv | n_predict | n_parallel |
  93. | 4 | 1024 | 1 | 1 |
  94. | 4 | 1024 | 1 | 4 |
  95. # FIXME: These tests fail on master.
  96. # Problems: unified KV cache (except for CPU backend with LLAMA_NO_LLAMAFILE=1), SIMD nondeterminism.
  97. # See https://github.com/ggerganov/whisper.cpp/issues/1941#issuecomment-1986923227
  98. # and https://github.com/ggerganov/llama.cpp/pull/6122#discussion_r1531405574
  99. # and https://github.com/ggerganov/llama.cpp/pull/7347 .
  100. # | 4 | 1024 | 100 | 1 |
  101. # This test still fails even the above patches; the first token probabilities are already different.
  102. # | 4 | 1024 | 100 | 4 |