Kawrakow
|
6769e944c7
k-quants : support for super-block size of 64 (#2001)
|
2 yıl önce |
Alex Renda
|
b061ba9e2a
llama : fix top-p sampling to match the canonical definition (#1953)
|
2 yıl önce |
Didzis Gosko
|
527b6fba1d
llama : make model stateless and context stateful (llama_state) (#1797)
|
2 yıl önce |
Ettore Di Giacinto
|
aacdbd4056
llama : fix params struct slignment (#1936)
|
2 yıl önce |
l3utterfly
|
ba4e85a833
llama : use aligned memory during ggml_init call from loading saved sessions (#1934)
|
2 yıl önce |
Kawrakow
|
cb40dfca69
llama : only use Q6_K for output weights if tensor size is multiple of 256 (#1932)
|
2 yıl önce |
Johannes Gäßler
|
16b9cd1939
Convert vector to f16 for dequantize mul mat vec (#1913)
|
2 yıl önce |
Johannes Gäßler
|
b24c3049d9
Added tokens per second to info prints (#1928)
|
2 yıl önce |
Johannes Gäßler
|
0ede372a51
Fixed incorrectly applying RMS norm twice (#1925)
|
2 yıl önce |
Kawrakow
|
8ab8ba62eb
llama : prevent usage of k-quants when tensor size is not a multiple of 256 (#1921)
|
2 yıl önce |
Georgi Gerganov
|
ce2c7d72e2
metal : handle buffers larger than device's maxBufferLength (#1826)
|
2 yıl önce |
Georgi Gerganov
|
051e1b0e6a
llama : fix kv_cache `n` init (close #1903)
|
2 yıl önce |
Howard Su
|
3d59ec5935
ggml : fix warnings under MSVC (#1908)
|
2 yıl önce |
Johannes Gäßler
|
ac3b886953
llama : fix embd when offloading non-repeating layers (#1891)
|
2 yıl önce |
Borislav Stanimirov
|
9cbf50c041
build : fix and ignore MSVC warnings (#1889)
|
2 yıl önce |
Johannes Gäßler
|
254a7a7a5f
CUDA full GPU acceleration, KV cache in VRAM (#1827)
|
2 yıl önce |
xaedes
|
e32089b2c2
train : improved training-from-scratch example (#1652)
|
2 yıl önce |
Kerfuffle
|
74d4cfa343
Allow "quantizing" to f16 and f32 (#1787)
|
2 yıl önce |
Kawrakow
|
74a6d922f1
Metal implementation for all k_quants (#1807)
|
2 yıl önce |
Howard Su
|
58970a4c39
Leverage mmap for offloading tensors to GPU (#1597)
|
2 yıl önce |
Kerfuffle
|
4f0154b0ba
llama : support requantizing models instead of only allowing quantization from 16/32bit (#1691)
|
2 yıl önce |
Robert Sung-wook Shin
|
98ed165574
OpenCL: Add release memory (#1741)
|
2 yıl önce |
Georgi Gerganov
|
2d7bf110ed
llama : fix vram_scratch var
|
2 yıl önce |
Georgi Gerganov
|
2a4e41a086
llama : fix compile warnings
|
2 yıl önce |
Johannes Gäßler
|
17366df842
Multi GPU support, CUDA refactor, CUDA scratch buffer (#1703)
|
2 yıl önce |
Georgi Gerganov
|
44f906e853
metal : add f16 support
|
2 yıl önce |
Georgi Gerganov
|
7a74dee6b4
llama : temporary disable Q6_K output quantization (#1711)
|
2 yıl önce |
Spencer Sutton
|
590250f7a9
metal : add checks for buffer size (#1706)
|
2 yıl önce |
mgroeber9110
|
c2df36d60d
llama : consistently catch and throw only exceptions deriving from std::exception (#1599)
|
2 yıl önce |
kiltyj
|
9d0693bce3
metal : use shared buffers between CPU and GPU (#1696)
|
2 yıl önce |