|
|
преди 3 седмици | |
|---|---|---|
| cmd | преди 3 седмици | |
| pkg | преди 3 седмици | |
| tests | преди 3 седмици | |
| .gitignore | преди 3 седмици | |
| Makefile | преди 3 седмици | |
| README.md | преди 3 седмици | |
| go.mod | преди 3 седмици | |
| go.sum | преди 3 седмици |
This is an experimental project and is no longer maintained.
The implementation and algorithms are heavily inspired by llama.cpp and vLLM.
High-performance LLM inference engine in Go, optimized with SIMD (AVX2/AVX512).
Build with Makefile:
make build
This produces binaries in bin/: makarna, quantize, convert.
Build with CUDA:
make build-cuda
Produces bin/makarna-cuda.
Alternatively, use Go install:
go install ./cmd/...
Convert HuggingFace models (.safetensors) to .mak format.
convert <hf_dir> <output.mak> [flags]
Flags:
--quant <type> Options: q2_k, q3_k, q4_k, q5_k, q6_k, q8_k.--mix Enable smart mix quantization.--workers <n> Number of parallel workers.--max-inflight-mb <n> Memory limit during conversion.Quantize an existing .mak file to a K-quant format.
quantize <input.mak> <output.mak> <type> [flags]
Flags:
--mix Enable smart mix mode.Inference CLI.
run-model -model <file.mak> -prompt "text" [flags]
Common Flags:
-steps <n> Max tokens (default 10).-temp <f> Temperature (default 0.7).-top-k <n> Top-K (default 40).-top-p <f> Top-P (default 0.9).-rep-penalty <f> Repetition penalty (default 1.1).-chat Use chat formatting.-threads <n> CPU threads (-1 = 90% of cores).-n-gpu-layers <n> Layers to offload to GPU (-1=auto).-gpu-budget <f> GPU memory fraction (0.0-1.0).-mmap Use mmap for weights.-profile-log <val> Profile output (true, report, or ).
-listen <addr> Start OpenAI-compatible server on .
Dedicated OpenAI-compatible API server.
openai -model <file.mak> [flags]
Flags:
-listen <addr> Default is :8080.-max-seq-len <n> Max context length.-n-gpu-layers <n> Number of GPU layers.MAK v2 supports K-quants (block size 256):
q8_k: 8-bit.q6_k: 6-bit.q5_k: 5-bit.q4_k: 4-bit (recommended).q3_k: 3-bit.q2_k: 2-bit.Convert and quantize:
convert /models/Qwen3-1.7B-Instruct model-q4k.mak --quant q4_k --mix
Run inference:
run-model -model model-q4k.mak -prompt "Explaining quantum physics" -steps 100
Start API server:
run-model -model model-q4k.mak -listen :8080 -chat
Tests:
go test ./...
go test -tags cuda ./... # Requires GPU
Benchmarks:
go test -bench=. ./pkg/tensor/...