Kawrakow
|
99009e72f8
ggml : add SOTA 2,3,4,5,6 bit k-quantizations (#1684)
|
2 years ago |
Johannes Gäßler
|
1fcdcc28b1
cuda : performance optimizations (#1530)
|
2 years ago |
Johannes Gäßler
|
affc76edfd
cuda : loading models directly into VRAM, norm calculation on GPU, broadcasting for ggml_mul (#1483)
|
2 years ago |
Georgi Gerganov
|
2d5db48371
ggml : use F16 instead of F32 in Q4_0, Q4_1, Q8_0 (#1508)
|
2 years ago |
Johannes Gäßler
|
eb363627fd
cuda : deduplicated dequantization code (#1453)
|
2 years ago |
Georgi Gerganov
|
08737ef720
cuda : fix convert function (#1412)
|
2 years ago |
Johannes Gäßler
|
905d87b70a
ggml : GPU-accelerated token generation (#1412)
|
2 years ago |
Georgi Gerganov
|
b9fd7eee57
ggml : remove bit shuffling (#1405)
|
2 years ago |
Johannes Gäßler
|
1f48b0abcf
Documented CUDA reproducibility, added warning (#1346)
|
2 years ago |
slaren
|
58b367c2d7
cuBLAS: refactor and optimize f16 mat mul performance (#1259)
|
2 years ago |
slaren
|
b925f1f1b0
cuBLAS: fall back to pageable memory if pinned alloc fails (#1233)
|
2 years ago |
slaren
|
7fc50c051a
cuBLAS: use host pinned memory and dequantize while copying (#1207)
|
2 years ago |
Henri Vasserman
|
b1ee8f59b4
cuBLAS: non-contiguous tensor support (#1215)
|
2 years ago |
Stephan Walter
|
36d19a603b
Remove Q4_3 which is no better than Q5 (#1218)
|
2 years ago |
Georgi Gerganov
|
574406dc7e
ggml : add Q5_0 and Q5_1 quantization (#1187)
|
2 years ago |
Georgi Gerganov
|
7a32fcb3b2
ggml : add Q8_0 quantization format (rename the old one to Q8_1) (ARM NEON) (#1179)
|
2 years ago |
slaren
|
50cb666b8a
Improve cuBLAS performance by using a memory pool (#1094)
|
2 years ago |
slaren
|
2005469ea1
Add Q4_3 support to cuBLAS (#1086)
|
2 years ago |
slaren
|
02d6988121
Improve cuBLAS performance by dequantizing on the GPU (#1065)
|
2 years ago |