Nessuna descrizione

Çetin b6c3eecc74 Initial commit		3 settimane fa
cmd	b6c3eecc74 Initial commit	3 settimane fa
pkg	b6c3eecc74 Initial commit	3 settimane fa
tests	b6c3eecc74 Initial commit	3 settimane fa
.gitignore	b6c3eecc74 Initial commit	3 settimane fa
Makefile	b6c3eecc74 Initial commit	3 settimane fa
README.md	b6c3eecc74 Initial commit	3 settimane fa
go.mod	b6c3eecc74 Initial commit	3 settimane fa
go.sum	b6c3eecc74 Initial commit	3 settimane fa

Experimental Project (Archived)

This is an experimental project and is no longer maintained.

The implementation and algorithms are heavily inspired by llama.cpp and vLLM.

Makarna Engine

High-performance LLM inference engine in Go, optimized with SIMD (AVX2/AVX512).

Installation

Build with Makefile:

make build

This produces binaries in bin/: makarna, quantize, convert.

Build with CUDA:

make build-cuda

Produces bin/makarna-cuda.

Alternatively, use Go install:

go install ./cmd/...

Commands

convert

Convert HuggingFace models (.safetensors) to .mak format.

convert <hf_dir> <output.mak> [flags]

Flags:

--quant <type> Options: q2_k, q3_k, q4_k, q5_k, q6_k, q8_k.
--mix Enable smart mix quantization.
--workers <n> Number of parallel workers.
--max-inflight-mb <n> Memory limit during conversion.

quantize

Quantize an existing .mak file to a K-quant format.

quantize <input.mak> <output.mak> <type> [flags]

Flags:

--mix Enable smart mix mode.

run-model

Inference CLI.

run-model -model <file.mak> -prompt "text" [flags]

Common Flags:

-steps <n> Max tokens (default 10).
-temp <f> Temperature (default 0.7).
-top-k <n> Top-K (default 40).
-top-p <f> Top-P (default 0.9).
-rep-penalty <f> Repetition penalty (default 1.1).
-chat Use chat formatting.
-threads <n> CPU threads (-1 = 90% of cores).
-n-gpu-layers <n> Layers to offload to GPU (-1=auto).
-gpu-budget <f> GPU memory fraction (0.0-1.0).
-mmap Use mmap for weights.
-profile-log <val> Profile output (true, report, or ).
-listen <addr> Start OpenAI-compatible server on .

openai

Dedicated OpenAI-compatible API server.

openai -model <file.mak> [flags]

Flags:

-listen <addr> Default is :8080.
-max-seq-len <n> Max context length.
-n-gpu-layers <n> Number of GPU layers.

Quantization Types

MAK v2 supports K-quants (block size 256):

q8_k: 8-bit.
q6_k: 6-bit.
q5_k: 5-bit.
q4_k: 4-bit (recommended).
q3_k: 3-bit.
q2_k: 2-bit.

Examples

Convert and quantize:

convert /models/Qwen3-1.7B-Instruct model-q4k.mak --quant q4_k --mix

Run inference:

run-model -model model-q4k.mak -prompt "Explaining quantum physics" -steps 100

Start API server:

run-model -model model-q4k.mak -listen :8080 -chat

Development

Tests:

go test ./...
go test -tags cuda ./... # Requires GPU

Benchmarks:

go test -bench=. ./pkg/tensor/...

README.md

Experimental Project (Archived)

Makarna Engine

Installation

Commands

convert

quantize

run-model

openai

Quantization Types

Examples

Development