llama.cpp is a large-scale C/C++ project for efficient LLM (Large Language Model) inference with minimal setup and dependencies. The project enables running language models on diverse hardware with state-of-the-art performance.
Key Facts:
libllama) and 40+ executable tools/examplesggml/ directory)ALWAYS run these commands in sequence:
cmake -B build
cmake --build build --config Release -j $(nproc)
Build time: ~10 minutes on 4-core system with ccache enabled, ~25 minutes without ccache.
Important Notes:
build/bin/-j) significantly reduce build timeFor CUDA support:
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j $(nproc)
For Metal (macOS):
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release -j $(nproc)
Important Note: While all backends can be built as long as the correct requirements for that backend are installed, you will not be able to run them without the correct hardware. The only backend that can be run for testing and validation is the CPU backend.
Single-config generators:
cmake -B build -DCMAKE_BUILD_TYPE=Debug
cmake --build build
Multi-config generators:
cmake -B build -G "Xcode"
cmake --build build --config Debug
ctest --test-dir build --output-on-failure -j $(nproc)
Test suite: 38 tests covering tokenizers, grammar parsing, sampling, backends, and integration Expected failures: 2-3 tests may fail if network access is unavailable (they download models) Test time: ~30 seconds for passing tests
Run server-specific unit tests after building the server:
# Build the server first
cmake --build build --target llama-server
# Navigate to server tests and run
cd tools/server/tests
source ../../../.venv/bin/activate
./tests.sh
Server test dependencies: The .venv environment includes the required dependencies for server unit tests (pytest, aiohttp, etc.). Tests can be run individually or with various options as documented in tools/server/tests/README.md.
# Test basic inference
./build/bin/llama-cli --version
# Test model loading (requires model file)
./build/bin/llama-cli -m path/to/model.gguf -p "Hello" -n 10
ALWAYS format C++ code before committing:
git clang-format
Configuration is in .clang-format with these key rules:
void * ptr (middle)int & ref (middle)ALWAYS activate the Python environment in .venv and use tools from that environment:
# Activate virtual environment
source .venv/bin/activate
Configuration files:
.flake8: flake8 settings (max-line-length=125, excludes examples/tools)pyrightconfig.json: pyright type checking configurationRun before committing:
pre-commit run --all-files
Key workflows that run on every PR:
.github/workflows/build.yml: Multi-platform builds.github/workflows/server.yml: Server functionality tests.github/workflows/python-lint.yml: Python code quality.github/workflows/python-type-check.yml: Python type checkingRun full CI locally before submitting PRs:
mkdir tmp
# CPU-only build
bash ./ci/run.sh ./tmp/results ./tmp/mnt
CI Runtime: 30-60 minutes depending on backend configuration
Add ggml-ci to commit message to trigger heavy CI workloads on the custom CI infrastructure.
src/: Main llama library implementation (llama.cpp, llama-*.cpp)include/: Public API headers, primarily include/llama.hggml/: Core tensor library (submodule with custom GGML framework)examples/: 30+ example applications and toolstools/: Additional development and utility tools (server benchmarks, tests)tests/: Comprehensive test suite with CTest integrationdocs/: Detailed documentation (build guides, API docs, etc.)scripts/: Utility scripts for CI, data processing, and automationcommon/: Shared utility code used across examplesCMakeLists.txt: Primary build configurationinclude/llama.h: Main C API header (~2000 lines)src/llama.cpp: Core library implementation (~8000 lines)CONTRIBUTING.md: Coding guidelines and PR requirements.clang-format: C++ formatting rules.pre-commit-config.yaml: Git hook configurationbuild/bin/)Primary tools:
llama-cli: Main inference toolllama-server: OpenAI-compatible HTTP serverllama-quantize: Model quantization utilityllama-perplexity: Model evaluation toolllama-bench: Performance benchmarkingllama-convert-llama2c-to-ggml: Model conversion utilitiesCMakeLists.txt, cmake/ directory.clang-format, .clang-tidy, .flake8.github/workflows/, ci/run.sh.gitignore (includes build artifacts, models, cache)git clang-formatcmake --build build --config Releasectest --test-dir build --output-on-failurecd tools/server/tests && source ../../../.venv/bin/activate && ./tests.shbuild/bin/# Benchmark inference performance
./build/bin/llama-bench -m model.gguf
# Evaluate model perplexity
./build/bin/llama-perplexity -m model.gguf -f dataset.txt
# Test backend operations
./build/bin/test-backend-ops
.venv is provided)apt install ccache or brew install ccachepip install pre-commitinclude/llama.h require careful considerationmasterbuild/, .ccache/, *.o, *.gguf)Only search for additional information if these instructions are incomplete or found to be incorrect. This document contains validated build and test procedures that work reliably across different environments.