results.feature 4.2 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118
  1. @llama.cpp
  2. @results
  3. Feature: Results
  4. Background: Server startup
  5. Given a server listening on localhost:8080
  6. And a model file tinyllamas/split/stories15M-00001-of-00003.gguf from HF repo ggml-org/models
  7. And a model file test-model-00001-of-00003.gguf
  8. And 128 as batch size
  9. And 1024 KV cache size
  10. And 128 max tokens to predict
  11. And continuous batching
  12. Scenario Outline: consistent results with same seed
  13. Given <n_slots> slots
  14. And 1.0 temperature
  15. Then the server is starting
  16. Then the server is healthy
  17. Given 4 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 42
  18. Given concurrent completion requests
  19. Then the server is busy
  20. Then the server is idle
  21. And all slots are idle
  22. Then all predictions are equal
  23. Examples:
  24. | n_slots |
  25. | 1 |
  26. # FIXME: unified KV cache nondeterminism
  27. # | 2 |
  28. Scenario Outline: different results with different seed
  29. Given <n_slots> slots
  30. And 1.0 temperature
  31. Then the server is starting
  32. Then the server is healthy
  33. Given 1 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 42
  34. Given 1 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 43
  35. Given 1 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 44
  36. Given 1 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 45
  37. Given concurrent completion requests
  38. Then the server is busy
  39. Then the server is idle
  40. And all slots are idle
  41. Then all predictions are different
  42. Examples:
  43. | n_slots |
  44. | 1 |
  45. | 2 |
  46. Scenario Outline: consistent results with same seed and varying batch size
  47. Given 4 slots
  48. And <temp> temperature
  49. # And 0 as draft
  50. Then the server is starting
  51. Then the server is healthy
  52. Given 1 prompts "Write a very long story about AI." with seed 42
  53. And concurrent completion requests
  54. # Then the server is busy # Not all slots will be utilized.
  55. Then the server is idle
  56. And all slots are idle
  57. Given <n_parallel> prompts "Write a very long story about AI." with seed 42
  58. And concurrent completion requests
  59. # Then the server is busy # Not all slots will be utilized.
  60. Then the server is idle
  61. And all slots are idle
  62. Then all predictions are equal
  63. Examples:
  64. | n_parallel | temp |
  65. | 1 | 0.0 |
  66. | 1 | 1.0 |
  67. # FIXME: unified KV cache nondeterminism
  68. # See https://github.com/ggerganov/whisper.cpp/issues/1941#issuecomment-1986923227
  69. # and https://github.com/ggerganov/llama.cpp/pull/6122#discussion_r1531405574
  70. # and https://github.com/ggerganov/llama.cpp/pull/7347 .
  71. # | 2 | 0.0 |
  72. # | 4 | 0.0 |
  73. # | 2 | 1.0 |
  74. # | 4 | 1.0 |
  75. Scenario Outline: consistent token probs with same seed and prompt
  76. Given <n_slots> slots
  77. And <n_kv> KV cache size
  78. And 1.0 temperature
  79. And <n_predict> max tokens to predict
  80. Then the server is starting
  81. Then the server is healthy
  82. Given 1 prompts "The meaning of life is" with seed 42
  83. And concurrent completion requests
  84. # Then the server is busy # Not all slots will be utilized.
  85. Then the server is idle
  86. And all slots are idle
  87. Given <n_parallel> prompts "The meaning of life is" with seed 42
  88. And concurrent completion requests
  89. # Then the server is busy # Not all slots will be utilized.
  90. Then the server is idle
  91. And all slots are idle
  92. Then all token probabilities are equal
  93. Examples:
  94. | n_slots | n_kv | n_predict | n_parallel |
  95. | 4 | 1024 | 1 | 1 |
  96. # FIXME: unified KV cache nondeterminism
  97. # See https://github.com/ggerganov/whisper.cpp/issues/1941#issuecomment-1986923227
  98. # and https://github.com/ggerganov/llama.cpp/pull/6122#discussion_r1531405574
  99. # and https://github.com/ggerganov/llama.cpp/pull/7347 .
  100. # | 4 | 1024 | 1 | 4 |
  101. # | 4 | 1024 | 100 | 1 |
  102. # This test still fails even the above patches; the first token probabilities are already different.
  103. # | 4 | 1024 | 100 | 4 |