results.feature 2.7 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081
  1. @llama.cpp
  2. @results
  3. Feature: Results
  4. Background: Server startup
  5. Given a server listening on localhost:8080
  6. And a model file tinyllamas/split/stories15M-00001-of-00003.gguf from HF repo ggml-org/models
  7. And a model file test-model-00001-of-00003.gguf
  8. And 128 as batch size
  9. And 1024 KV cache size
  10. And 128 max tokens to predict
  11. And continuous batching
  12. Scenario Outline: consistent results with same seed
  13. Given <n_slots> slots
  14. Then the server is starting
  15. Then the server is healthy
  16. Given 4 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 42
  17. Given concurrent completion requests
  18. Then the server is busy
  19. Then the server is idle
  20. And all slots are idle
  21. Then all predictions are equal
  22. Examples:
  23. | n_slots |
  24. | 1 |
  25. | 2 |
  26. Scenario Outline: different results with different seed
  27. Given <n_slots> slots
  28. Then the server is starting
  29. Then the server is healthy
  30. Given 1 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 42
  31. Given 1 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 43
  32. Given 1 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 44
  33. Given 1 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 45
  34. Given concurrent completion requests
  35. Then the server is busy
  36. Then the server is idle
  37. And all slots are idle
  38. Then all predictions are different
  39. Examples:
  40. | n_slots |
  41. | 1 |
  42. | 2 |
  43. Scenario Outline: consistent results with same seed and varying batch size
  44. Given 4 slots
  45. And <temp> temperature
  46. # And 0 as draft
  47. Then the server is starting
  48. Then the server is healthy
  49. Given 1 prompts "Write a very long story about AI." with seed 42
  50. And concurrent completion requests
  51. # Then the server is busy # Not all slots will be utilized.
  52. Then the server is idle
  53. And all slots are idle
  54. Given <n_parallel> prompts "Write a very long story about AI." with seed 42
  55. And concurrent completion requests
  56. # Then the server is busy # Not all slots will be utilized.
  57. Then the server is idle
  58. And all slots are idle
  59. Then all predictions are equal
  60. Examples:
  61. | n_parallel | temp |
  62. | 1 | 0.0 |
  63. | 2 | 0.0 |
  64. | 4 | 0.0 |
  65. | 1 | 1.0 |
  66. # FIXME: These tests fail on master. The problem seems to be the unified KV cache.
  67. # See https://github.com/ggerganov/whisper.cpp/issues/1941#issuecomment-1986923227
  68. # and https://github.com/ggerganov/llama.cpp/pull/6122#discussion_r1531405574 .
  69. # | 2 | 1.0 |
  70. # | 4 | 1.0 |