| 123456789101112131415161718192021222324252627282930313233343536 |
- @llama.cpp
- @infill
- Feature: llama.cpp server
- # The current model is made by adding FIM tokens to the existing stories260K
- # We may want to use a better model in the future, maybe something like SmolLM 360M
- Background: Server startup
- Given a server listening on localhost:8080
- And a model file tinyllamas/stories260K-infill.gguf from HF repo ggml-org/models
- And a model file test-model-infill.gguf
- And a model alias tinyllama-infill
- And 42 as server seed
- And 1024 as batch size
- And 1024 as ubatch size
- And 2048 KV cache size
- And 64 max tokens to predict
- And 0.0 temperature
- Then the server is starting
- Then the server is healthy
- Scenario: Infill without input_extra
- Given a prompt "Complete this"
- And an infill input extra none none
- And an infill input prefix "#include <cstdio>\n#include \"llama.h\"\n\nint main() {\n int n_threads = llama_"
- And an infill input suffix "}\n"
- And an infill request with no api error
- Then 64 tokens are predicted matching One|day|she|saw|big|scary|bird
- Scenario: Infill with input_extra
- Given a prompt "Complete this"
- And an infill input extra "llama.h" "LLAMA_API int32_t llama_n_threads();\n"
- And an infill input prefix "#include <cstdio>\n#include \"llama.h\"\n\nint main() {\n int n_threads = llama_"
- And an infill input suffix "}\n"
- And an infill request with no api error
- Then 64 tokens are predicted matching cuts|Jimmy|mom|came|into|the|room"
|