slaren
|
0d56246f4b
ggml : group all experts in a single ggml_mul_mat_id (#6505)
|
1 year ago |
Johannes Gäßler
|
b5e7285baf
CUDA: fix matrix multiplication logic for tests (#6667)
|
1 year ago |
Carolinabanana
|
5dc9dd7152
llama : add Command R Plus support (#6491)
|
1 year ago |
Slava Primenko
|
f77261a7c5
ggml: bypass code incompatible with CUDA < 11.1 (whisper/2020)
|
1 year ago |
slaren
|
08a0c02060
ggml : mul_mat_id use the same tensor for all the experts (#6387)
|
1 year ago |
compilade
|
557410b8f0
llama : greatly reduce output buffer memory usage (#6122)
|
1 year ago |
Kawrakow
|
55c1b2a3bb
IQ1_M: 1.75 bpw quantization (#6302)
|
1 year ago |
slaren
|
ae1f211ce2
cuda : refactor into multiple files (#6269)
|
1 year ago |
slaren
|
2f0e81e053
cuda : add LLAMA_CUDA_NO_PEER_COPY to workaround broken ROCm p2p copy (#6208)
|
1 year ago |
slaren
|
d0a71233fb
cuda : disable host register by default (#6206)
|
1 year ago |
slaren
|
03a8f8fafe
cuda : fix LLAMA_CUDA_F16 build (#6197)
|
1 year ago |
Kawrakow
|
76aa30a263
Add ability to use Q5_0, Q5_1, and IQ4_NL for quantized K cache (#6183)
|
1 year ago |
slaren
|
42e21c6882
cuda : fix conflict with std::swap (#6186)
|
1 year ago |
slaren
|
1c51f98adc
cuda : print the returned error when CUDA initialization fails (#6185)
|
1 year ago |
slaren
|
ccf58aa3ec
cuda : refactor to remove global resources (#6170)
|
1 year ago |
slaren
|
2bf8d0f7c4
backend : offload large batches to GPU (#6083)
|
1 year ago |
slaren
|
3020327f6c
cuda : disable unused cudaLaunchHostFunc code (#6078)
|
1 year ago |
slaren
|
f30ea47a87
llama : add pipeline parallelism support (#6017)
|
1 year ago |
Georgi Gerganov
|
8030da7afe
ggml : reuse quantum structs across backends (#5943)
|
1 year ago |
Kawrakow
|
44ca159faf
1.5 bit: we can do even better (#5999)
|
1 year ago |
Kawrakow
|
be858f6205
Better 1.5 bit quantization (#5971)
|
1 year ago |
Georgi Gerganov
|
8a3012a4ad
ggml : add ggml-common.h to deduplicate shared code (#5940)
|
1 year ago |
Michael Podvitskiy
|
9fa2627347
ggml : introduce ggml_status (ggml/750)
|
1 year ago |
leejet
|
7d43c585dc
add some new ops, fix some operators and add batch operations to certain operators. (ggml/747)
|
1 year ago |
slaren
|
67be2ce101
cuda : fix data race in soft max (#5853)
|
1 year ago |
Kawrakow
|
bbde6eb256
ggml : IQ3_S improvements (#5829)
|
1 year ago |
UEXTM.com
|
5f70671856
Introduce backend GUIDs (ggml/743)
|
1 year ago |
Kawrakow
|
7c4263d426
ggml : make i-quants work with super-blocks of 64 (CPU,Metal) (#5760)
|
1 year ago |
Kawrakow
|
0becb22ac0
IQ4_XS: a 4.25 bpw quantization (#5747)
|
1 year ago |
Engininja2
|
c24a2a6e60
cuda : replace remaining shfl_xor with calls to warp_reduce functions (#5744)
|
1 year ago |