Johannes Gäßler
|
e9fd8dcab4
llama-fit-params: keep explicit --ctx-size 0 (#19070)
|
5 days ago |
ddh0
|
13f1e4a9ca
llama : add adaptive-p sampler (#17927)
|
2 weeks ago |
Georgi Gerganov
|
39173bcacb
context : reserve new scheduler when graph topology changes (#18547)
|
2 weeks ago |
Xuan-Son Nguyen
|
a7e6ddb8bd
lora: make sure model keep track of associated adapters (#18490)
|
2 weeks ago |
Georgi Gerganov
|
f5f8812f7c
server : use different seeds for child completions (#18700)
|
2 weeks ago |
Johannes Gäßler
|
64848deb18
llama-fit-params: free memory target per device (#18679)
|
3 weeks ago |
Julius Tischbein
|
2038101bd9
llama : add `use_direct_io` flag for model loading (#18166)
|
3 weeks ago |
Tarek Dakhran
|
73d284a250
model : add LFM2-ColBert-350M (#18607)
|
3 weeks ago |
Daniel Bevenius
|
d3dce4e0a5
sampling : add support for backend sampling (#17004)
|
3 weeks ago |
Xuan-Son Nguyen
|
cd78e57c3a
lora: count lora nodes in graph_max_nodes (#18469)
|
1 month ago |
Johannes Gäßler
|
026d2ad472
llama: fix magic number of 999 for GPU layers (#18266)
|
1 month ago |
Johannes Gäßler
|
a52dc60ba3
llama_fit_params: return enum for fail vs. error (#18374)
|
1 month ago |
Johannes Gäßler
|
b1f3a6e5db
llama: automatically set parameters not set by the user in such a way that maximizes GPU utilization (#16653)
|
1 month ago |
Aaron Teo
|
877566d512
llama: introduce support for model-embedded sampling parameters (#17120)
|
2 months ago |
Sigbjørn Skjæret
|
9008027aa3
hparams : add n_embd_inp() to support extended embed (#16928)
|
2 months ago |
Georgi Gerganov
|
16bcc1259d
kv-cache : pad the cache size to 256 for performance (#17046)
|
2 months ago |
Georgi Gerganov
|
cd5e3b5754
server : support unified cache across slots (#16736)
|
2 months ago |
Adrian Lundberg
|
76af40aaaa
docs: remove llama_sampler_accept reference in sampling sample usage (#16920)
|
2 months ago |
JJJYmmm
|
d261223d24
model: add support for qwen3vl series (#16780)
|
3 months ago |
Gadflyii
|
3df2244df4
llama : add --no-host to disable host buffers (#16310)
|
3 months ago |
ddh0
|
f6dcda3900
server : context checkpointing for hybrid and recurrent models (#16382)
|
3 months ago |
Johannes Gäßler
|
e789095502
llama: print memory breakdown on exit (#15860)
|
4 months ago |
Gabe Goodhart
|
fd621880f3
aLoRA Support (#15327)
|
4 months ago |
Georgi Gerganov
|
e92d53b29e
sampling : optimize samplers by reusing bucket sort (#15665)
|
5 months ago |
Johannes Gäßler
|
e81b8e4b7f
llama: use FA + max. GPU layers by default (#15434)
|
5 months ago |
Sigbjørn Skjæret
|
84ab83cc0b
model : jina-embeddings-v3 support (#13693)
|
5 months ago |
Georgi Gerganov
|
9ebebef62f
llama : remove KV cache defragmentation logic (#15473)
|
5 months ago |
Georgi Gerganov
|
cd36b5e5c7
llama : remove deprecated llama_kv_self API (#15472)
|
5 months ago |
Georgi Gerganov
|
715a6db02c
kv-cache : drop the "unified" prefix (#15467)
|
5 months ago |
Georgi Gerganov
|
d32e03f449
server : add SWA checkpoints (#15293)
|
5 months ago |