# Experimental Project (Archived)
This is an experimental project and is no longer maintained.

The implementation and algorithms are heavily inspired by [llama.cpp](https://github.com/ggerganov/llama.cpp) and [vLLM](https://github.com/vllm-project/vllm).

# Makarna Engine
High-performance LLM inference engine in Go, optimized with SIMD (AVX2/AVX512).

## Installation

Build with Makefile:
```bash
make build
```
This produces binaries in `bin/`: `makarna`, `quantize`, `convert`.

Build with CUDA:
```bash
make build-cuda
```
Produces `bin/makarna-cuda`.

Alternatively, use Go install:
```bash
go install ./cmd/...
```

## Commands

### convert
Convert HuggingFace models (.safetensors) to .mak format.
```bash
convert <hf_dir> <output.mak> [flags]
```
Flags:
- `--quant <type>` Options: q2_k, q3_k, q4_k, q5_k, q6_k, q8_k.
- `--mix` Enable smart mix quantization.
- `--workers <n>` Number of parallel workers.
- `--max-inflight-mb <n>` Memory limit during conversion.

### quantize
Quantize an existing .mak file to a K-quant format.
```bash
quantize <input.mak> <output.mak> <type> [flags]
```
Flags:
- `--mix` Enable smart mix mode.

### run-model
Inference CLI.
```bash
run-model -model <file.mak> -prompt "text" [flags]
```
Common Flags:
- `-steps <n>` Max tokens (default 10).
- `-temp <f>` Temperature (default 0.7).
- `-top-k <n>` Top-K (default 40).
- `-top-p <f>` Top-P (default 0.9).
- `-rep-penalty <f>` Repetition penalty (default 1.1).
- `-chat` Use chat formatting.
- `-threads <n>` CPU threads (-1 = 90% of cores).
- `-n-gpu-layers <n>` Layers to offload to GPU (-1=auto).
- `-gpu-budget <f>` GPU memory fraction (0.0-1.0).
- `-mmap` Use mmap for weights.
- `-profile-log <val>` Profile output (true, report, or <file>).
- `-listen <addr>` Start OpenAI-compatible server on <addr>.

### openai
Dedicated OpenAI-compatible API server.
```bash
openai -model <file.mak> [flags]
```
Flags:
- `-listen <addr>` Default is :8080.
- `-max-seq-len <n>` Max context length.
- `-n-gpu-layers <n>` Number of GPU layers.

## Quantization Types
MAK v2 supports K-quants (block size 256):
- `q8_k`: 8-bit.
- `q6_k`: 6-bit.
- `q5_k`: 5-bit.
- `q4_k`: 4-bit (recommended).
- `q3_k`: 3-bit.
- `q2_k`: 2-bit.

## Examples

Convert and quantize:
```bash
convert /models/Qwen3-1.7B-Instruct model-q4k.mak --quant q4_k --mix
```

Run inference:
```bash
run-model -model model-q4k.mak -prompt "Explaining quantum physics" -steps 100
```

Start API server:
```bash
run-model -model model-q4k.mak -listen :8080 -chat
```

## Development

Tests:
```bash
go test ./...
go test -tags cuda ./... # Requires GPU
```

Benchmarks:
```bash
go test -bench=. ./pkg/tensor/...
```