|
|
há 1 ano atrás | |
|---|---|---|
| .devops | há 1 ano atrás | |
| .github | há 1 ano atrás | |
| ci | há 1 ano atrás | |
| cmake | há 1 ano atrás | |
| common | há 1 ano atrás | |
| docs | há 1 ano atrás | |
| examples | há 1 ano atrás | |
| ggml | há 1 ano atrás | |
| gguf-py | há 1 ano atrás | |
| grammars | há 1 ano atrás | |
| include | há 1 ano atrás | |
| media | há 1 ano atrás | |
| models | há 1 ano atrás | |
| pocs | há 1 ano atrás | |
| prompts | há 2 anos atrás | |
| requirements | há 1 ano atrás | |
| scripts | há 1 ano atrás | |
| spm-headers | há 1 ano atrás | |
| src | há 1 ano atrás | |
| tests | há 1 ano atrás | |
| .clang-format | há 1 ano atrás | |
| .clang-tidy | há 1 ano atrás | |
| .dockerignore | há 1 ano atrás | |
| .ecrc | há 1 ano atrás | |
| .editorconfig | há 1 ano atrás | |
| .flake8 | há 1 ano atrás | |
| .gitignore | há 1 ano atrás | |
| .gitmodules | há 1 ano atrás | |
| .pre-commit-config.yaml | há 1 ano atrás | |
| AUTHORS | há 1 ano atrás | |
| CMakeLists.txt | há 1 ano atrás | |
| CMakePresets.json | há 1 ano atrás | |
| CONTRIBUTING.md | há 1 ano atrás | |
| LICENSE | há 1 ano atrás | |
| Makefile | há 1 ano atrás | |
| Package.swift | há 1 ano atrás | |
| README.md | há 1 ano atrás | |
| SECURITY.md | há 1 ano atrás | |
| convert_hf_to_gguf.py | há 1 ano atrás | |
| convert_hf_to_gguf_update.py | há 1 ano atrás | |
| convert_llama_ggml_to_gguf.py | há 1 ano atrás | |
| convert_lora_to_gguf.py | há 1 ano atrás | |
| flake.lock | há 1 ano atrás | |
| flake.nix | há 1 ano atrás | |
| mypy.ini | há 2 anos atrás | |
| poetry.lock | há 1 ano atrás | |
| pyproject.toml | há 1 ano atrás | |
| pyrightconfig.json | há 1 ano atrás | |
| requirements.txt | há 1 ano atrás |
Roadmap / Project status / Manifesto / ggml
Inference of Meta's LLaMA model (and others) in pure C/C++
The main goal of llama.cpp is to enable LLM inference with minimal setup and state-of-the-art performance on a wide
range of hardware - locally and in the cloud.
The llama.cpp project is the main playground for developing new features for the ggml library.
6624c5cec3)
- [x] [GPT-NeoX](https://github.com/EleutherAI/gpt-neox) + [Pythia](https://github.com/EleutherAI/pythia)
- [x] [Snowflake-Arctic MoE](https://huggingface.co/collections/Snowflake/arctic-66290090ab)
- [x] [Smaug](https://huggingface.co/models?search=Smaug)
- [x] [Poro 34B](https://huggingface.co/LumiOpen/Poro-34B)
- [x] [Bitnet b1.58 models](https://huggingface.co/1bitLLM)
- [x] [Flan T5](https://huggingface.co/models?search=flan-t5)
- [x] [Open Elm models](https://huggingface.co/collections/apple/openelm-instruct-models-6619ad295d)
- [x] [ChatGLM3-6b](https://huggingface.co/THUDM/chatglm3-6b) + [ChatGLM4-9b](https://huggingface.co/THUDM/glm-4-9b)
- [x] [SmolLM](https://huggingface.co/collections/HuggingFaceTB/smollm-6695016cad)
- [x] [EXAONE-3.0-7.8B-Instruct](https://huggingface.co/LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct)
- [x] [FalconMamba Models](https://huggingface.co/collections/tiiuae/falconmamba-7b-66b9a58032)
- [x] [Jais](https://huggingface.co/inceptionai/jais-13b-chat)
- [x] [Bielik-11B-v2.3](https://huggingface.co/collections/speakleash/bielik-11b-v23-66ee813238)
- [x] [RWKV-6](https://github.com/BlinkDL/RWKV-LM)
**Multimodal:**
- [x] [LLaVA 1.5 models](https://huggingface.co/collections/liuhaotian/llava-15-653aac15d9), [LLaVA 1.6 models](https://huggingface.co/collections/liuhaotian/llava-16-65b9e40155)
- [x] [BakLLaVA](https://huggingface.co/models?search=SkunkworksAI/Bakllava)
- [x] [Obsidian](https://huggingface.co/NousResearch/Obsidian-3B-V0.5)
- [x] [ShareGPT4V](https://huggingface.co/models?search=Lin-Chen/ShareGPT4V)
- [x] [MobileVLM 1.7B/3B models](https://huggingface.co/models?search=mobileVLM)
- [x] [Yi-VL](https://huggingface.co/models?search=Yi-VL)
- [x] [Mini CPM](https://huggingface.co/models?search=MiniCPM)
- [x] [Moondream](https://huggingface.co/vikhyatk/moondream2)
- [x] [Bunny](https://github.com/BAAI-DCAI/Bunny)
| Backend | Target devices |
|---|---|
| Metal | Apple Silicon |
| BLAS | All |
| BLIS | All |
| SYCL | Intel and Nvidia GPU |
| MUSA | Moore Threads MTT GPU |
| CUDA | Nvidia GPU |
| hipBLAS | AMD GPU |
| Vulkan | GPU |
| CANN | Ascend NPU |
The main product of this project is the llama library. Its C-style interface can be found in include/llama.h.
The project also includes many example programs and tools using the llama library. The examples range from simple, minimal code snippets to sophisticated sub-projects such as an OpenAI-compatible HTTP server. Possible methods for obtaining the binaries:
llama.cpp via brew, flox or nixThe Hugging Face platform hosts a number of LLMs compatible with llama.cpp:
After downloading a model, use the CLI tools to run it locally - see below.
llama.cpp requires the model to be stored in the GGUF file format. Models in other data formats can be converted to GGUF using the convert_*.py Python scripts in this repo.
The Hugging Face platform provides a variety of online tools for converting, quantizing and hosting models with llama.cpp:
llama.cpp in the cloud (more info: https://github.com/ggerganov/llama.cpp/discussions/9669)To learn more about model quantization, read this documentation
llama-cli toolRun a basic text completion:
llama-cli -m your_model.gguf -p "I believe the meaning of life is" -n 128
# Output:
# I believe the meaning of life is to find your own truth and to live in accordance with it. For me, this means being true to myself and following my passions, even if they don't align with societal expectations. I think that's what I love about yoga – it's not just a physical practice, but a spiritual one too. It's about connecting with yourself, listening to your inner voice, and honoring your own unique journey.
See this page for a full list of parameters.
Run llama-cli in conversation/chat mode by passing the -cnv parameter:
llama-cli -m your_model.gguf -p "You are a helpful assistant" -cnv
# Output:
# > hi, who are you?
# Hi there! I'm your helpful assistant! I'm an AI-powered chatbot designed to assist and provide information to users like you. I'm here to help answer your questions, provide guidance, and offer support on a wide range of topics. I'm a friendly and knowledgeable AI, and I'm always happy to help with anything you need. What's on your mind, and how can I assist you today?
#
# > what is 1+1?
# Easy peasy! The answer to 1+1 is... 2!
By default, the chat template will be taken from the input model. If you want to use another chat template, pass --chat-template NAME as a parameter. See the list of supported templates
llama-cli -m your_model.gguf -p "You are a helpful assistant" -cnv --chat-template chatml
You can also use your own template via in-prefix, in-suffix and reverse-prompt parameters:
llama-cli -m your_model.gguf -p "You are a helpful assistant" -cnv --in-prefix 'User: ' --reverse-prompt 'User:'
llama.cpp can constrain the output of the model via custom grammars. For example, you can force the model to output only JSON:
llama-cli -m your_model.gguf -n 256 --grammar-file grammars/json.gbnf -p 'Request: schedule a call at 8pm; Command:'
The grammars/ folder contains a handful of sample grammars. To write your own, check out the GBNF Guide.
For authoring more complex JSON grammars, check out https://grammar.intrinsiclabs.ai/
llama-server)The llama-server is a lightweight OpenAI API compatible HTTP server that can be used to serve local models and easily connect them to existing clients.
Example usage:
llama-server -m your_model.gguf --port 8080
# Basic web UI can be accessed via browser: http://localhost:8080
# Chat completion endpoint: http://localhost:8080/v1/chat/completions
Use the llama-perplexity tool to measure perplexity over a given prompt (lower perplexity is better).
For more information, see https://huggingface.co/docs/transformers/perplexity.
To learn more how to measure perplexity using llama.cpp, read this documentation
llama.cpp repo and merge PRs into the master branchDevelopment documentation
Seminal papers and background on the models
If your issue is with model generation quality, then please at least scan the following links and papers to understand the limitations of LLaMA models. This is especially important when choosing an appropriate model size and appreciating both the significant and subtle differences between LLaMA models and ChatGPT: