Daniel Bevenius
|
d3dce4e0a5
sampling : add support for backend sampling (#17004)
|
3 weeks ago |
Georgi Gerganov
|
a554a1ecc7
context : fix reserve token padding to n_seqs (#18536)
|
3 weeks ago |
Xuan-Son Nguyen
|
cd78e57c3a
lora: count lora nodes in graph_max_nodes (#18469)
|
4 weeks ago |
Johannes Gäßler
|
026d2ad472
llama: fix magic number of 999 for GPU layers (#18266)
|
1 month ago |
Johannes Gäßler
|
147a521636
tool/ex/tests: consistently free ctx, then model (#18168)
|
1 month ago |
Johannes Gäßler
|
b1f3a6e5db
llama: automatically set parameters not set by the user in such a way that maximizes GPU utilization (#16653)
|
1 month ago |
Georgi Gerganov
|
609a2d0268
models : fix YaRN regression + consolidate logic (#18006)
|
1 month ago |
Jeff Bolz
|
5266379bca
llama_context: synchronize before reallocating output buffer (#17974)
|
1 month ago |
Georgi Gerganov
|
4dff236a52
ggml : remove GGML_KQ_MASK_PAD constant (#17910)
|
1 month ago |
Piotr Wilkin (ilintar)
|
e4e9c4329c
Make graph_max_nodes vary by ubatch size (#17794)
|
1 month ago |
Diego Devesa
|
e072b2052e
ggml : add GGML_SCHED_NO_REALLOC option to disable reallocations in ggml_backend_sched (#17276)
|
2 months ago |
Piotr Wilkin (ilintar)
|
ff55414c42
model : Qwen3 Next (#16095)
|
2 months ago |
Daniel Bevenius
|
134e6940ca
llama : skip output reordering for single token batches (#17466)
|
2 months ago |
Sigbjørn Skjæret
|
9008027aa3
hparams : add n_embd_inp() to support extended embed (#16928)
|
2 months ago |
Georgi Gerganov
|
16bcc1259d
kv-cache : pad the cache size to 256 for performance (#17046)
|
2 months ago |
Johannes Gäßler
|
aa374175c3
CUDA: fix crash on uneven context without FA (#16988)
|
2 months ago |
Georgi Gerganov
|
cd5e3b5754
server : support unified cache across slots (#16736)
|
2 months ago |
Diego Devesa
|
5a4ff43e7d
llama : disable pipeline parallelism if compute buffer allocation fails (#16748)
|
3 months ago |
takuya kodama
|
7062dd8460
llama-context: only warn on pooling_type when user specified (#16674)
|
3 months ago |
Saba Fallah
|
e08db42595
model: EmbeddingGemma Adding Support for SentenceTransformers Dense Modules (#16367)
|
3 months ago |
Johannes Gäßler
|
e789095502
llama: print memory breakdown on exit (#15860)
|
4 months ago |
Sigbjørn Skjæret
|
b8e09f08b9
model : add grok-2 support (#15539)
|
4 months ago |
Haiyue Wang
|
f4e664f838
context : remove redundant explicit casting to the same type (#15948)
|
4 months ago |
Daniel Bevenius
|
86587da03b
llama : check returned fn ptrs from ggml_backend_reg_get_proc_address (#15893)
|
4 months ago |
Georgi Gerganov
|
663027fd54
context : fix n_outputs during reserve (#15858)
|
4 months ago |
Daniel Bevenius
|
d1e2adba65
llama : set n_outputs to 1 to avoid 0 outputs mean-pooling (#15791)
|
4 months ago |
Diego Devesa
|
274966226f
llama : fix fattn reserve call n_seqs parameter (#15699)
|
5 months ago |
Diego Devesa
|
9777032dcc
llama : separate compute buffer reserve from fattn check (#15696)
|
5 months ago |
Johannes Gäßler
|
e81b8e4b7f
llama: use FA + max. GPU layers by default (#15434)
|
5 months ago |
Georgi Gerganov
|
8a4280ce43
kv-cache : remove LLAMA_SET_ROWS checks (#15505)
|
5 months ago |