Howard Su
|
64cc19b4fe
Fix the validation of main device (#1872)
|
2 年之前 |
Johannes Gäßler
|
254a7a7a5f
CUDA full GPU acceleration, KV cache in VRAM (#1827)
|
2 年之前 |
Howard Su
|
58970a4c39
Leverage mmap for offloading tensors to GPU (#1597)
|
2 年之前 |
Kyle Liang
|
12b063f0ec
Fixed WSL cuda's OOM error (#1594)
|
2 年之前 |
Johannes Gäßler
|
ae9663f188
Windows nvcc workaround (#1753)
|
2 年之前 |
Georgi Gerganov
|
5c64a0952e
k-quants : allow to optionally disable at compile time (#1734)
|
2 年之前 |
Johannes Gäßler
|
17366df842
Multi GPU support, CUDA refactor, CUDA scratch buffer (#1703)
|
2 年之前 |
Kawrakow
|
99009e72f8
ggml : add SOTA 2,3,4,5,6 bit k-quantizations (#1684)
|
2 年之前 |
Johannes Gäßler
|
1fcdcc28b1
cuda : performance optimizations (#1530)
|
2 年之前 |
Johannes Gäßler
|
affc76edfd
cuda : loading models directly into VRAM, norm calculation on GPU, broadcasting for ggml_mul (#1483)
|
2 年之前 |
Georgi Gerganov
|
2d5db48371
ggml : use F16 instead of F32 in Q4_0, Q4_1, Q8_0 (#1508)
|
2 年之前 |
Johannes Gäßler
|
eb363627fd
cuda : deduplicated dequantization code (#1453)
|
2 年之前 |
Georgi Gerganov
|
08737ef720
cuda : fix convert function (#1412)
|
2 年之前 |
Johannes Gäßler
|
905d87b70a
ggml : GPU-accelerated token generation (#1412)
|
2 年之前 |
Georgi Gerganov
|
b9fd7eee57
ggml : remove bit shuffling (#1405)
|
2 年之前 |
Johannes Gäßler
|
1f48b0abcf
Documented CUDA reproducibility, added warning (#1346)
|
2 年之前 |
slaren
|
58b367c2d7
cuBLAS: refactor and optimize f16 mat mul performance (#1259)
|
2 年之前 |
slaren
|
b925f1f1b0
cuBLAS: fall back to pageable memory if pinned alloc fails (#1233)
|
2 年之前 |
slaren
|
7fc50c051a
cuBLAS: use host pinned memory and dequantize while copying (#1207)
|
2 年之前 |
Henri Vasserman
|
b1ee8f59b4
cuBLAS: non-contiguous tensor support (#1215)
|
2 年之前 |
Stephan Walter
|
36d19a603b
Remove Q4_3 which is no better than Q5 (#1218)
|
2 年之前 |
Georgi Gerganov
|
574406dc7e
ggml : add Q5_0 and Q5_1 quantization (#1187)
|
2 年之前 |
Georgi Gerganov
|
7a32fcb3b2
ggml : add Q8_0 quantization format (rename the old one to Q8_1) (ARM NEON) (#1179)
|
2 年之前 |
slaren
|
50cb666b8a
Improve cuBLAS performance by using a memory pool (#1094)
|
2 年之前 |
slaren
|
2005469ea1
Add Q4_3 support to cuBLAS (#1086)
|
2 年之前 |
slaren
|
02d6988121
Improve cuBLAS performance by dequantizing on the GPU (#1065)
|
2 年之前 |