Nessuna descrizione

Çetin b6c3eecc74 Initial commit 3 settimane fa
cmd b6c3eecc74 Initial commit 3 settimane fa
pkg b6c3eecc74 Initial commit 3 settimane fa
tests b6c3eecc74 Initial commit 3 settimane fa
.gitignore b6c3eecc74 Initial commit 3 settimane fa
Makefile b6c3eecc74 Initial commit 3 settimane fa
README.md b6c3eecc74 Initial commit 3 settimane fa
go.mod b6c3eecc74 Initial commit 3 settimane fa
go.sum b6c3eecc74 Initial commit 3 settimane fa

README.md

Experimental Project (Archived)

This is an experimental project and is no longer maintained.

The implementation and algorithms are heavily inspired by llama.cpp and vLLM.

Makarna Engine

High-performance LLM inference engine in Go, optimized with SIMD (AVX2/AVX512).

Installation

Build with Makefile:

make build

This produces binaries in bin/: makarna, quantize, convert.

Build with CUDA:

make build-cuda

Produces bin/makarna-cuda.

Alternatively, use Go install:

go install ./cmd/...

Commands

convert

Convert HuggingFace models (.safetensors) to .mak format.

convert <hf_dir> <output.mak> [flags]

Flags:

  • --quant <type> Options: q2_k, q3_k, q4_k, q5_k, q6_k, q8_k.
  • --mix Enable smart mix quantization.
  • --workers <n> Number of parallel workers.
  • --max-inflight-mb <n> Memory limit during conversion.

quantize

Quantize an existing .mak file to a K-quant format.

quantize <input.mak> <output.mak> <type> [flags]

Flags:

  • --mix Enable smart mix mode.

run-model

Inference CLI.

run-model -model <file.mak> -prompt "text" [flags]

Common Flags:

  • -steps <n> Max tokens (default 10).
  • -temp <f> Temperature (default 0.7).
  • -top-k <n> Top-K (default 40).
  • -top-p <f> Top-P (default 0.9).
  • -rep-penalty <f> Repetition penalty (default 1.1).
  • -chat Use chat formatting.
  • -threads <n> CPU threads (-1 = 90% of cores).
  • -n-gpu-layers <n> Layers to offload to GPU (-1=auto).
  • -gpu-budget <f> GPU memory fraction (0.0-1.0).
  • -mmap Use mmap for weights.
  • -profile-log <val> Profile output (true, report, or ).
  • -listen <addr> Start OpenAI-compatible server on .
  • openai

    Dedicated OpenAI-compatible API server.

    openai -model <file.mak> [flags]
    

    Flags:

    • -listen <addr> Default is :8080.
    • -max-seq-len <n> Max context length.
    • -n-gpu-layers <n> Number of GPU layers.

    Quantization Types

    MAK v2 supports K-quants (block size 256):

    • q8_k: 8-bit.
    • q6_k: 6-bit.
    • q5_k: 5-bit.
    • q4_k: 4-bit (recommended).
    • q3_k: 3-bit.
    • q2_k: 2-bit.

    Examples

    Convert and quantize:

    convert /models/Qwen3-1.7B-Instruct model-q4k.mak --quant q4_k --mix
    

    Run inference:

    run-model -model model-q4k.mak -prompt "Explaining quantum physics" -steps 100
    

    Start API server:

    run-model -model model-q4k.mak -listen :8080 -chat
    

    Development

    Tests:

    go test ./...
    go test -tags cuda ./... # Requires GPU
    

    Benchmarks:

    go test -bench=. ./pkg/tensor/...