Fără Descriere

Çetin b6c3eecc74 Initial commit 3 săptămâni în urmă
cmd b6c3eecc74 Initial commit 3 săptămâni în urmă
pkg b6c3eecc74 Initial commit 3 săptămâni în urmă
tests b6c3eecc74 Initial commit 3 săptămâni în urmă
.gitignore b6c3eecc74 Initial commit 3 săptămâni în urmă
Makefile b6c3eecc74 Initial commit 3 săptămâni în urmă
README.md b6c3eecc74 Initial commit 3 săptămâni în urmă
go.mod b6c3eecc74 Initial commit 3 săptămâni în urmă
go.sum b6c3eecc74 Initial commit 3 săptămâni în urmă

README.md

Experimental Project (Archived)

This is an experimental project and is no longer maintained.

The implementation and algorithms are heavily inspired by llama.cpp and vLLM.

Makarna Engine

High-performance LLM inference engine in Go, optimized with SIMD (AVX2/AVX512).

Installation

Build with Makefile:

make build

This produces binaries in bin/: makarna, quantize, convert.

Build with CUDA:

make build-cuda

Produces bin/makarna-cuda.

Alternatively, use Go install:

go install ./cmd/...

Commands

convert

Convert HuggingFace models (.safetensors) to .mak format.

convert <hf_dir> <output.mak> [flags]

Flags:

  • --quant <type> Options: q2_k, q3_k, q4_k, q5_k, q6_k, q8_k.
  • --mix Enable smart mix quantization.
  • --workers <n> Number of parallel workers.
  • --max-inflight-mb <n> Memory limit during conversion.

quantize

Quantize an existing .mak file to a K-quant format.

quantize <input.mak> <output.mak> <type> [flags]

Flags:

  • --mix Enable smart mix mode.

run-model

Inference CLI.

run-model -model <file.mak> -prompt "text" [flags]

Common Flags:

  • -steps <n> Max tokens (default 10).
  • -temp <f> Temperature (default 0.7).
  • -top-k <n> Top-K (default 40).
  • -top-p <f> Top-P (default 0.9).
  • -rep-penalty <f> Repetition penalty (default 1.1).
  • -chat Use chat formatting.
  • -threads <n> CPU threads (-1 = 90% of cores).
  • -n-gpu-layers <n> Layers to offload to GPU (-1=auto).
  • -gpu-budget <f> GPU memory fraction (0.0-1.0).
  • -mmap Use mmap for weights.
  • -profile-log <val> Profile output (true, report, or ).
  • -listen <addr> Start OpenAI-compatible server on .
  • openai

    Dedicated OpenAI-compatible API server.

    openai -model <file.mak> [flags]
    

    Flags:

    • -listen <addr> Default is :8080.
    • -max-seq-len <n> Max context length.
    • -n-gpu-layers <n> Number of GPU layers.

    Quantization Types

    MAK v2 supports K-quants (block size 256):

    • q8_k: 8-bit.
    • q6_k: 6-bit.
    • q5_k: 5-bit.
    • q4_k: 4-bit (recommended).
    • q3_k: 3-bit.
    • q2_k: 2-bit.

    Examples

    Convert and quantize:

    convert /models/Qwen3-1.7B-Instruct model-q4k.mak --quant q4_k --mix
    

    Run inference:

    run-model -model model-q4k.mak -prompt "Explaining quantum physics" -steps 100
    

    Start API server:

    run-model -model model-q4k.mak -listen :8080 -chat
    

    Development

    Tests:

    go test ./...
    go test -tags cuda ./... # Requires GPU
    

    Benchmarks:

    go test -bench=. ./pkg/tensor/...