infill.feature 1.5 KB

123456789101112131415161718192021222324252627282930313233343536
  1. @llama.cpp
  2. @infill
  3. Feature: llama.cpp server
  4. # The current model is made by adding FIM tokens to the existing stories260K
  5. # We may want to use a better model in the future, maybe something like SmolLM 360M
  6. Background: Server startup
  7. Given a server listening on localhost:8080
  8. And a model file tinyllamas/stories260K-infill.gguf from HF repo ggml-org/models
  9. And a model file test-model-infill.gguf
  10. And a model alias tinyllama-infill
  11. And 42 as server seed
  12. And 1024 as batch size
  13. And 1024 as ubatch size
  14. And 2048 KV cache size
  15. And 64 max tokens to predict
  16. And 0.0 temperature
  17. Then the server is starting
  18. Then the server is healthy
  19. Scenario: Infill without input_extra
  20. Given a prompt "Complete this"
  21. And an infill input extra none none
  22. And an infill input prefix "#include <cstdio>\n#include \"llama.h\"\n\nint main() {\n int n_threads = llama_"
  23. And an infill input suffix "}\n"
  24. And an infill request with no api error
  25. Then 64 tokens are predicted matching One|day|she|saw|big|scary|bird
  26. Scenario: Infill with input_extra
  27. Given a prompt "Complete this"
  28. And an infill input extra "llama.h" "LLAMA_API int32_t llama_n_threads();\n"
  29. And an infill input prefix "#include <cstdio>\n#include \"llama.h\"\n\nint main() {\n int n_threads = llama_"
  30. And an infill input suffix "}\n"
  31. And an infill request with no api error
  32. Then 64 tokens are predicted matching cuts|Jimmy|mom|came|into|the|room"