| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118 |
- @llama.cpp
- @results
- Feature: Results
- Background: Server startup
- Given a server listening on localhost:8080
- And a model file tinyllamas/split/stories15M-00001-of-00003.gguf from HF repo ggml-org/models
- And a model file test-model-00001-of-00003.gguf
- And 128 as batch size
- And 1024 KV cache size
- And 128 max tokens to predict
- And continuous batching
- Scenario Outline: consistent results with same seed
- Given <n_slots> slots
- And 1.0 temperature
- Then the server is starting
- Then the server is healthy
- Given 4 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 42
- Given concurrent completion requests
- Then the server is busy
- Then the server is idle
- And all slots are idle
- Then all predictions are equal
- Examples:
- | n_slots |
- | 1 |
- # FIXME: unified KV cache nondeterminism
- # | 2 |
- Scenario Outline: different results with different seed
- Given <n_slots> slots
- And 1.0 temperature
- Then the server is starting
- Then the server is healthy
- Given 1 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 42
- Given 1 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 43
- Given 1 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 44
- Given 1 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 45
- Given concurrent completion requests
- Then the server is busy
- Then the server is idle
- And all slots are idle
- Then all predictions are different
- Examples:
- | n_slots |
- | 1 |
- | 2 |
- Scenario Outline: consistent results with same seed and varying batch size
- Given 4 slots
- And <temp> temperature
- # And 0 as draft
- Then the server is starting
- Then the server is healthy
- Given 1 prompts "Write a very long story about AI." with seed 42
- And concurrent completion requests
- # Then the server is busy # Not all slots will be utilized.
- Then the server is idle
- And all slots are idle
- Given <n_parallel> prompts "Write a very long story about AI." with seed 42
- And concurrent completion requests
- # Then the server is busy # Not all slots will be utilized.
- Then the server is idle
- And all slots are idle
- Then all predictions are equal
- Examples:
- | n_parallel | temp |
- | 1 | 0.0 |
- | 1 | 1.0 |
- # FIXME: unified KV cache nondeterminism
- # See https://github.com/ggerganov/whisper.cpp/issues/1941#issuecomment-1986923227
- # and https://github.com/ggerganov/llama.cpp/pull/6122#discussion_r1531405574
- # and https://github.com/ggerganov/llama.cpp/pull/7347 .
- # | 2 | 0.0 |
- # | 4 | 0.0 |
- # | 2 | 1.0 |
- # | 4 | 1.0 |
- Scenario Outline: consistent token probs with same seed and prompt
- Given <n_slots> slots
- And <n_kv> KV cache size
- And 1.0 temperature
- And <n_predict> max tokens to predict
- Then the server is starting
- Then the server is healthy
- Given 1 prompts "The meaning of life is" with seed 42
- And concurrent completion requests
- # Then the server is busy # Not all slots will be utilized.
- Then the server is idle
- And all slots are idle
- Given <n_parallel> prompts "The meaning of life is" with seed 42
- And concurrent completion requests
- # Then the server is busy # Not all slots will be utilized.
- Then the server is idle
- And all slots are idle
- Then all token probabilities are equal
- Examples:
- | n_slots | n_kv | n_predict | n_parallel |
- | 4 | 1024 | 1 | 1 |
- # FIXME: unified KV cache nondeterminism
- # See https://github.com/ggerganov/whisper.cpp/issues/1941#issuecomment-1986923227
- # and https://github.com/ggerganov/llama.cpp/pull/6122#discussion_r1531405574
- # and https://github.com/ggerganov/llama.cpp/pull/7347 .
- # | 4 | 1024 | 1 | 4 |
- # | 4 | 1024 | 100 | 1 |
- # This test still fails even the above patches; the first token probabilities are already different.
- # | 4 | 1024 | 100 | 4 |
|