David Friehs
|
df845cc982
llama : minimize size used for state save/load (#4820)
|
2 years ago |
Georgi Gerganov
|
15ebe59210
convert : update phi-2 to latest HF repo (#4903)
|
2 years ago |
slaren
|
e7e4df031b
llama : ggml-backend integration (#4766)
|
2 years ago |
Georgi Gerganov
|
584d674be6
llama : remove redundant assert for StableLM (#4901)
|
2 years ago |
Georgi Gerganov
|
3cabe80630
llama : fix typo "imp_embd" -> "inp_embd"
|
2 years ago |
Georgi Gerganov
|
f445c0e68c
llama : fix llm_build_k_shift to use correct n_rot (#4889)
|
2 years ago |
Kawrakow
|
469e75d0a3
llama : restore intended k-quants mixes for MoE models (#4872)
|
2 years ago |
Kawrakow
|
49662cbed3
ggml : SOTA 2-bit quants (add IQ2_XS) (#4856)
|
2 years ago |
pudepiedj
|
43f76bf1c3
main : print total token count and tokens consumed so far (#4874)
|
2 years ago |
Brian
|
57d016ba2d
llama : add additional suffixes for model params (#4834)
|
2 years ago |
Austin
|
329ff61569
llama : recognize 1B phi models (#4847)
|
2 years ago |
Kawrakow
|
dd5ae06405
SOTA 2-bit quants (#4773)
|
2 years ago |
Georgi Gerganov
|
b0034d93ce
examples : add passkey test (#3856)
|
2 years ago |
Georgi Gerganov
|
9dede37d81
llama : remove unused vars (#4796)
|
2 years ago |
Georgi Gerganov
|
3c36213df8
llama : remove redundant GQA check (#4796)
|
2 years ago |
Georgi Gerganov
|
d117d4dc5d
llama : print tensor meta for debugging
|
2 years ago |
Georgi Gerganov
|
540938f890
llama : llama_model_desc print number of experts
|
2 years ago |
Marcus Dunn
|
0040d42eeb
llama : replace all API facing `int`'s with `int32_t` (#4577)
|
2 years ago |
postmasters
|
83e633c27e
llama : differentiate the KV dims in the attention (#4657)
|
2 years ago |
automaticcat
|
24a447e20a
ggml : add ggml_cpu_has_avx_vnni() (#4589)
|
2 years ago |
manikbhandari
|
ea5497df5d
gpt2 : Add gpt2 architecture integration (#4555)
|
2 years ago |
Nam D. Tran
|
f6793491b5
llama : add AWQ for llama, llama2, mpt, and mistral models (#4593)
|
2 years ago |
slaren
|
dc68f0054c
cuda : fix vmm pool with multi GPU (#4620)
|
2 years ago |
Shintarou Okada
|
753be377b6
llama : add PLaMo model (#3557)
|
2 years ago |
slaren
|
5bf3953d7e
cuda : improve cuda pool efficiency using virtual memory (#4606)
|
2 years ago |
slaren
|
708e179e85
fallback to CPU buffer if host buffer alloc fails (#4610)
|
2 years ago |
slaren
|
48b7ff193e
llama : fix platforms without mmap (#4578)
|
2 years ago |
crasm
|
c7e9701f86
llama : add ability to cancel model loading (#4462)
|
2 years ago |
Georgi Gerganov
|
afefa319f1
ggml : change ggml_scale to take a float instead of tensor (#4573)
|
2 years ago |
slaren
|
d232aca5a7
llama : initial ggml-backend integration (#4520)
|
2 years ago |