| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384 |
- @llama.cpp
- Feature: llama.cpp server
- Background: Server startup
- Given a server listening on localhost:8080
- And a model file stories260K.gguf
- And a model alias tinyllama-2
- And 42 as server seed
- # KV Cache corresponds to the total amount of tokens
- # that can be stored across all independent sequences: #4130
- # see --ctx-size and #5568
- And 32 KV cache size
- And 1 slots
- And embeddings extraction
- And 32 server max tokens to predict
- And prometheus compatible metrics exposed
- Then the server is starting
- Then the server is healthy
- Scenario: Health
- Then the server is ready
- And all slots are idle
- Scenario Outline: Completion
- Given a prompt <prompt>
- And <n_predict> max tokens to predict
- And a completion request with no api error
- Then <n_predicted> tokens are predicted matching <re_content>
- And prometheus metrics are exposed
- Examples: Prompts
- | prompt | n_predict | re_content | n_predicted |
- | I believe the meaning of life is | 8 | read | 8 |
- | Write a joke about AI | 64 | (park<or>friends<or>scared)+ | 32 |
- Scenario Outline: OAI Compatibility
- Given a model <model>
- And a system prompt <system_prompt>
- And a user prompt <user_prompt>
- And <max_tokens> max tokens to predict
- And streaming is <enable_streaming>
- Given an OAI compatible chat completions request with no api error
- Then <n_predicted> tokens are predicted matching <re_content>
- Examples: Prompts
- | model | system_prompt | user_prompt | max_tokens | re_content | n_predicted | enable_streaming |
- | llama-2 | Book | What is the best book | 8 | (Mom<or>what)+ | 8 | disabled |
- | codellama70b | You are a coding assistant. | Write the fibonacci function in c++. | 64 | (thanks<or>happy<or>bird)+ | 32 | enabled |
- Scenario: Embedding
- When embeddings are computed for:
- """
- What is the capital of Bulgaria ?
- """
- Then embeddings are generated
- Scenario: OAI Embeddings compatibility
- Given a model tinyllama-2
- When an OAI compatible embeddings computation request for:
- """
- What is the capital of Spain ?
- """
- Then embeddings are generated
- Scenario: OAI Embeddings compatibility with multiple inputs
- Given a model tinyllama-2
- Given a prompt:
- """
- In which country Paris is located ?
- """
- And a prompt:
- """
- Is Madrid the capital of Spain ?
- """
- When an OAI compatible embeddings computation request for multiple inputs
- Then embeddings are generated
- Scenario: Tokenize / Detokenize
- When tokenizing:
- """
- What is the capital of France ?
- """
- Then tokens can be detokenize
|