llama.cpp INI Presets

Introduction

The INI preset feature, introduced in PR#17859, allows users to create reusable and shareable parameter configurations for llama.cpp.

Using Presets with the Server

When running multiple models on the server (router mode), INI preset files can be used to configure model-specific parameters. Please refer to the server documentation for more details.

Using a Remote Preset

[!NOTE]

This feature is currently only supported via the -hf option.

For GGUF models hosted on Hugging Face, you can include a preset.ini file in the root directory of the repository to define specific configurations for that model.

Example:

hf-repo-draft = username/my-draft-model-GGUF
temp = 0.5
top-k = 20
top-p = 0.95

For security reasons, only certain options are allowed. Please refer to preset.cpp for the complete list of permitted options.

Example usage:

Assuming your repository username/my-model-with-preset contains a preset.ini with the configuration above:

llama-cli -hf username/my-model-with-preset

# This is equivalent to:
llama-cli -hf username/my-model-with-preset \
  --hf-repo-draft username/my-draft-model-GGUF \
  --temp 0.5 \
  --top-k 20 \
  --top-p 0.95

You can also override preset arguments by specifying them on the command line:

# Force temp = 0.1, overriding the preset value
llama-cli -hf username/my-model-with-preset --temp 0.1

If you want to define multiple preset configurations for one or more GGUF models, you can create a blank HF repo for each preset. Each HF repo should contain a preset.ini file that references the actual model(s):

hf-repo = user/my-model-main
hf-repo-draft = user/my-model-draft
temp = 0.8
ctx-size = 1024
; (and other configurations)

Named presets

If you want to define multiple preset configurations for one or more GGUF models, you can create a blank HF repo containing a single preset.ini file that references the actual model(s):

[*]
mmap = 1

[gpt-oss-20b-hf]
hf          = ggml-org/gpt-oss-20b-GGUF
batch-size  = 2048
ubatch-size = 2048
top-p       = 1.0
top-k       = 0
min-p       = 0.01
temp        = 1.0
chat-template-kwargs = {"reasoning_effort": "high"}

[gpt-oss-120b-hf]
hf          = ggml-org/gpt-oss-120b-GGUF
batch-size  = 2048
ubatch-size = 2048
top-p       = 1.0
top-k       = 0
min-p       = 0.01
temp        = 1.0
chat-template-kwargs = {"reasoning_effort": "high"}

You can then use it via llama-cli or llama-server, example:

llama-server -hf user/repo:gpt-oss-120b-hf

Please make sure to provide the correct hf-repo for each child preset. Otherwise, you may get error: The specified tag is not a valid quantization scheme.

preset.md 2.8 KB History Raw

llama.cpp INI Presets

Introduction

Using Presets with the Server

Using a Remote Preset

Named presets

preset.md 2.8 KB

History Raw