# Experimental Project (Archived) This is an experimental project and is no longer maintained. The implementation and algorithms are heavily inspired by [llama.cpp](https://github.com/ggerganov/llama.cpp) and [vLLM](https://github.com/vllm-project/vllm). # Makarna Engine High-performance LLM inference engine in Go, optimized with SIMD (AVX2/AVX512). ## Installation Build with Makefile: ```bash make build ``` This produces binaries in `bin/`: `makarna`, `quantize`, `convert`. Build with CUDA: ```bash make build-cuda ``` Produces `bin/makarna-cuda`. Alternatively, use Go install: ```bash go install ./cmd/... ``` ## Commands ### convert Convert HuggingFace models (.safetensors) to .mak format. ```bash convert [flags] ``` Flags: - `--quant ` Options: q2_k, q3_k, q4_k, q5_k, q6_k, q8_k. - `--mix` Enable smart mix quantization. - `--workers ` Number of parallel workers. - `--max-inflight-mb ` Memory limit during conversion. ### quantize Quantize an existing .mak file to a K-quant format. ```bash quantize [flags] ``` Flags: - `--mix` Enable smart mix mode. ### run-model Inference CLI. ```bash run-model -model -prompt "text" [flags] ``` Common Flags: - `-steps ` Max tokens (default 10). - `-temp ` Temperature (default 0.7). - `-top-k ` Top-K (default 40). - `-top-p ` Top-P (default 0.9). - `-rep-penalty ` Repetition penalty (default 1.1). - `-chat` Use chat formatting. - `-threads ` CPU threads (-1 = 90% of cores). - `-n-gpu-layers ` Layers to offload to GPU (-1=auto). - `-gpu-budget ` GPU memory fraction (0.0-1.0). - `-mmap` Use mmap for weights. - `-profile-log ` Profile output (true, report, or ). - `-listen ` Start OpenAI-compatible server on . ### openai Dedicated OpenAI-compatible API server. ```bash openai -model [flags] ``` Flags: - `-listen ` Default is :8080. - `-max-seq-len ` Max context length. - `-n-gpu-layers ` Number of GPU layers. ## Quantization Types MAK v2 supports K-quants (block size 256): - `q8_k`: 8-bit. - `q6_k`: 6-bit. - `q5_k`: 5-bit. - `q4_k`: 4-bit (recommended). - `q3_k`: 3-bit. - `q2_k`: 2-bit. ## Examples Convert and quantize: ```bash convert /models/Qwen3-1.7B-Instruct model-q4k.mak --quant q4_k --mix ``` Run inference: ```bash run-model -model model-q4k.mak -prompt "Explaining quantum physics" -steps 100 ``` Start API server: ```bash run-model -model model-q4k.mak -listen :8080 -chat ``` ## Development Tests: ```bash go test ./... go test -tags cuda ./... # Requires GPU ``` Benchmarks: ```bash go test -bench=. ./pkg/tensor/... ```