Georgi Gerganov 518329b2d4 parallel : add option for non-shared and larger prompts (#13598)		vor 8 Monaten
..
CMakeLists.txt	7cc2d2c889 ggml : move AMX to the CPU backend (#10570)	vor 1 Jahr
README.md	518329b2d4 parallel : add option for non-shared and larger prompts (#13598)	vor 8 Monaten
parallel.cpp	518329b2d4 parallel : add option for non-shared and larger prompts (#13598)	vor 8 Monaten

llama.cpp/example/parallel

Simplified simulation of serving incoming requests in parallel

Example

Generate 128 client requests (-ns 128), simulating 8 concurrent clients (-np 8). The system prompt is shared (-pps), meaning that it is computed once at the start. The client requests consist of 10 junk questions (-j 10) followed by the actual question.

llama-parallel -m model.gguf -np 8 -ns 128 --top-k 1 -pps --junk 10 -c 16384

[!NOTE] It's recommended to use base models with this example. Instruction tuned models might not be able to properly follow the custom chat template specified here, so the results might not be as expected.

README.md

llama.cpp/example/parallel

Example