Johannes Gäßler
|
d50f8897a7
CUDA: stream-k decomposition for MMQ (#8018)
|
1 year ago |
Calvin Laurenson
|
43b35e38ba
Add support for sqrt on CUDA (#7953)
|
1 year ago |
Johannes Gäßler
|
76d66ee0be
CUDA: faster q2_K, q3_K MMQ + int8 tensor cores (#7921)
|
1 year ago |
slaren
|
f578b86b21
move BLAS to a separate backend (#6210)
|
1 year ago |
Georgi Gerganov
|
a9cae48003
tests : add non-cont unary tests (#7857)
|
1 year ago |
Johannes Gäßler
|
42b53d192f
CUDA: revise q8_1 data layout for mul_mat_q (#7824)
|
1 year ago |
Johannes Gäßler
|
7d1a378b8f
CUDA: refactor mmq, dmmv, mmvq (#7716)
|
1 year ago |
agray3
|
b90dc566c1
Allow number of nodes in CUDA graph to change (#7738)
|
1 year ago |
Johannes Gäßler
|
9b596417af
CUDA: quantized KV support for FA vec (#7527)
|
1 year ago |
Georgi Gerganov
|
fb76ec31a9
ggml : fix YARN + add tests + add asserts (#7617)
|
1 year ago |
Djip007
|
852aafb163
update HIP_UMA #7399 (#7414)
|
1 year ago |
agray3
|
197c00681b
Allow multiple copy function pointers for CUDA graph kernel param updates (#7565)
|
1 year ago |
slaren
|
ab33f7a338
cuda : clear error after buffer allocation failure (#7376)
|
1 year ago |
fraxy-v
|
f5bf761747
Capture CUDA logging output (#7298)
|
1 year ago |
agray3
|
dc020985b8
Avoid unnecessarily disabling CUDA graphs (#7302)
|
1 year ago |
Johannes Gäßler
|
dc685be466
CUDA: add FP32 FlashAttention vector kernel (#7188)
|
1 year ago |
Justina Cho
|
f5ef34e428
feat: implemented sigmoid function (ggml/806)
|
1 year ago |
Georgi Gerganov
|
9cb317f77e
ggml : full ALiBi support (#7192)
|
1 year ago |
agray3
|
bc4bba364f
Introduction of CUDA Graphs to LLama.cpp (#6766)
|
1 year ago |
William Tambellini
|
858f6b73f6
Add an option to build without CUDA VMM (#7067)
|
1 year ago |
Georgi Gerganov
|
9c67c2773d
ggml : add Flash Attention (#5021)
|
1 year ago |
slaren
|
0d56246f4b
ggml : group all experts in a single ggml_mul_mat_id (#6505)
|
1 year ago |
Johannes Gäßler
|
b5e7285baf
CUDA: fix matrix multiplication logic for tests (#6667)
|
1 year ago |
Carolinabanana
|
5dc9dd7152
llama : add Command R Plus support (#6491)
|
1 year ago |
Slava Primenko
|
f77261a7c5
ggml: bypass code incompatible with CUDA < 11.1 (whisper/2020)
|
1 year ago |
slaren
|
08a0c02060
ggml : mul_mat_id use the same tensor for all the experts (#6387)
|
1 year ago |
compilade
|
557410b8f0
llama : greatly reduce output buffer memory usage (#6122)
|
1 year ago |
Kawrakow
|
55c1b2a3bb
IQ1_M: 1.75 bpw quantization (#6302)
|
1 year ago |
slaren
|
ae1f211ce2
cuda : refactor into multiple files (#6269)
|
1 year ago |
slaren
|
2f0e81e053
cuda : add LLAMA_CUDA_NO_PEER_COPY to workaround broken ROCm p2p copy (#6208)
|
1 year ago |