slaren
|
31ec3993f6
ggml : add GGML_CUDA_USE_GRAPHS option, restore GGML_CUDA_FORCE_CUBLAS (cmake) (#8140)
|
1 anno fa |
slaren
|
c7ab7b612c
make : fix missing -O3 (#8143)
|
1 anno fa |
Georgi Gerganov
|
f2d48fffde
sync : ggml
|
1 anno fa |
Georgi Gerganov
|
4713bf3093
authors : regen
|
1 anno fa |
Georgi Gerganov
|
0e814dfc42
devops : remove clblast + LLAMA_CUDA -> GGML_CUDA (#8139)
|
1 anno fa |
Georgi Gerganov
|
a95631ee97
readme : update API notes
|
1 anno fa |
Georgi Gerganov
|
f3f65429c4
llama : reorganize source code + improve CMake (#8006)
|
1 anno fa |
Isaac McFadyen
|
8854044561
Clarify default MMQ for CUDA and LLAMA_CUDA_FORCE_MMQ flag (#8115)
|
1 anno fa |
Johannes Gäßler
|
c8771ab5f8
CUDA: fix misaligned shared memory read (#8123)
|
1 anno fa |
Eddie-Wang
|
494165f3b6
llama : extend llm_build_ffn() to support _scale tensors (#8103)
|
1 anno fa |
Olivier Chafik
|
9b2f16f805
`json`: better support for "type" unions (e.g. nullable arrays w/ typed items) (#7863)
|
1 anno fa |
Olivier Chafik
|
6777c544bd
`json`: fix additionalProperties, allow space after enum/const (#7840)
|
1 anno fa |
jukofyork
|
163d50adaf
fixes #7999 (adds control vectors to all `build_XXX()` functions in `llama.cpp` [needs testing] (#8060)
|
1 anno fa |
fairydreaming
|
6fcbf68235
llama : implement Unigram tokenizer needed by T5 and FLAN-T5 model families (#5763)
|
1 anno fa |
Daniel Bevenius
|
e6bf007744
llama : return nullptr from llama_grammar_init (#8093)
|
1 anno fa |
Olivier Chafik
|
84631fe150
`json`: support integer minimum, maximum, exclusiveMinimum, exclusiveMaximum (#7797)
|
1 anno fa |
slaren
|
dd047b476c
disable docker CI on pull requests (#8110)
|
1 anno fa |
joecryptotoo
|
925c30956d
Add healthchecks to llama-server containers (#8081)
|
1 anno fa |
Brian
|
c8ad35955a
Gguf dump start data offset via --data-offset and some extra refactor (#8054)
|
1 anno fa |
Xuan Son Nguyen
|
49c03c79cd
cvector: better prompt handling, add "mean vector" method (#8069)
|
1 anno fa |
Xuan Son Nguyen
|
48e6b92cc3
Add chat template support for llama-cli (#8068)
|
1 anno fa |
HanishKVC
|
3791ad2193
SimpleChat v3.1: Boolean chat request options in Settings UI, cache_prompt (#7950)
|
1 anno fa |
HatsuneMikuUwU33
|
f702a90e24
Update control vector help (#8104)
|
1 anno fa |
Meng, Hengyu
|
083bacce14
[SYCL] Re-enabled mul_mat_batched_sycl (#8095)
|
1 anno fa |
Johannes Gäßler
|
2df373ac40
CUDA: fix matrix multiplication algorithm choice (#8102)
|
1 anno fa |
Johannes Gäßler
|
3b099bcd9c
CUDA: fix MMQ writeback for int8 tensor cores (#8100)
|
1 anno fa |
Johannes Gäßler
|
a818f3028d
CUDA: use MMQ instead of cuBLAS by default (#8075)
|
1 anno fa |
fairydreaming
|
d62e4aaa02
gguf-py : fix tensor groups for encoder-decoder models in gguf-dump.py (#8090)
|
1 anno fa |
Johannes Gäßler
|
9a590c8226
CUDA: optimize MMQ int8 tensor core performance (#8062)
|
1 anno fa |
Christian Zhou-Zheng
|
52fc8705a0
Option to split during conversion (#6942)
|
1 anno fa |