[!IMPORTANT] This build documentation is specific only to IBM Z & LinuxONE mainframes (s390x). You can find the build documentation for other architectures: build.md.
The main product of this project is the llama library. Its C-style interface can be found in include/llama.h.
The project also includes many example programs and tools using the llama library. The examples range from simple, minimal code snippets to sophisticated sub-projects such as an OpenAI-compatible HTTP server.
To get the code:
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
Building llama.cpp with BLAS support is highly recommended as it has shown to provide performance improvements. Make sure to have OpenBLAS installed in your environment.
cmake -S . -B build \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_BLAS=ON \
-DGGML_BLAS_VENDOR=OpenBLAS
cmake --build build --config Release -j $(nproc)
Notes:
By default, VXE/VXE2 is enabled. To disable it (not recommended):
cmake -S . -B build \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_BLAS=ON \
-DGGML_BLAS_VENDOR=OpenBLAS \
-DGGML_VXE=OFF
cmake --build build --config Release -j $(nproc)
By default, NNPA is disabled by default. To enable it:
cmake -S . -B build \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_BLAS=ON \
-DGGML_BLAS_VENDOR=OpenBLAS \
-DGGML_NNPA=ON
cmake --build build --config Release -j $(nproc)
For debug builds:
cmake -S . -B build \
-DCMAKE_BUILD_TYPE=Debug \
-DGGML_BLAS=ON \
-DGGML_BLAS_VENDOR=OpenBLAS
cmake --build build --config Debug -j $(nproc)
For static builds, add -DBUILD_SHARED_LIBS=OFF:
cmake -S . -B build \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_BLAS=ON \
-DGGML_BLAS_VENDOR=OpenBLAS \
-DBUILD_SHARED_LIBS=OFF
cmake --build build --config Release -j $(nproc)
This provides acceleration using the IBM zAIU co-processor located in the Telum I and Telum II processors. Make sure to have the IBM zDNN library installed.
You may find the official build instructions here: Building and Installing zDNN
cmake -S . -B build \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_ZDNN=ON
cmake --build build --config Release -j$(nproc)
All models need to be converted to Big-Endian. You can achieve this in three cases:
Use pre-converted models verified for use on IBM Z & LinuxONE (easiest)
You can find popular models pre-converted and verified at s390x Verified Models or s390x Runnable Models.
These models have already been converted from safetensors to GGUF Big-Endian and their respective tokenizers verified to run correctly on IBM z15 and later system.
Convert safetensors model to GGUF Big-Endian directly (recommended)
The model you are trying to convert must be in safetensors file format (for example IBM Granite 3.3 2B). Make sure you have downloaded the model repository for this case.
Ensure that you have installed the required packages in advance
pip3 install -r requirements.txt
Convert the safetensors model to GGUF
python3 convert_hf_to_gguf.py \
--outfile model-name-be.f16.gguf \
--outtype f16 \
--bigendian \
model-directory/
For example,
python3 convert_hf_to_gguf.py \
--outfile granite-3.3-2b-instruct-be.f16.gguf \
--outtype f16 \
--bigendian \
granite-3.3-2b-instruct/
Convert existing GGUF Little-Endian model to Big-Endian
The model you are trying to convert must be in gguf file format (for example IBM Granite 3.3 2B GGUF). Make sure you have downloaded the model file for this case.
python3 gguf-py/gguf/scripts/gguf_convert_endian.py model-name.f16.gguf BIG
For example,
python3 gguf-py/gguf/scripts/gguf_convert_endian.py granite-3.3-2b-instruct-le.f16.gguf BIG
mv granite-3.3-2b-instruct-le.f16.gguf granite-3.3-2b-instruct-be.f16.gguf
Notes:
Only available in IBM z15/LinuxONE 3 or later system with the -DGGML_VXE=ON (turned on by default) compile flag. No hardware acceleration is possible with llama.cpp with older systems, such as IBM z14/arch12. In such systems, the APIs can still run but will use a scalar implementation.
Only available in IBM z16/LinuxONE 4 or later system with the -DGGML_NNPA=ON (turned off by default) compile flag. No hardware acceleration is possible with llama.cpp with older systems, such as IBM z15/arch13. In such systems, the APIs can still run but will use a scalar implementation.
Only available in IBM z17/LinuxONE 5 or later system with the -DGGML_ZDNN=ON compile flag. No hardware acceleration is possible with llama.cpp with older systems, such as IBM z15/arch13. In such systems, the APIs will default back to CPU routines.
Only available with IBM z17 / LinuxONE 5 or later system. No support currently available.
It is strongly recommended to use only LPAR (Type-1) virtualization to get the most performance.
Note: Type-2 virtualization is not supported at the moment, while you can get it running, the performance will not be the best.
It is recommended to allocate a minimum of 8 shared IFLs assigned to the LPAR. Increasing the IFL count past 8 shared IFLs will only improve Prompt Processing performance but not Token Generation.
Note: IFL count does not equate to vCPU count.
It is strongly recommended to disable SMT via the kernel boot parameters as it negatively affects performance. Please refer to your Linux distribution's guide on disabling SMT via kernel boot parameters.
IBM VXE/VXE2 SIMD acceleration depends on the BLAS implementation. It is strongly recommended to use BLAS.
I'm getting the following error message while trying to load a model: gguf_init_from_file_impl: failed to load model: this GGUF file version 50331648 is extremely large, is there a mismatch between the host and model endianness?
Answer: Please ensure that the model you have downloaded/converted is GGUFv3 Big-Endian. These models are usually denoted with the -be suffix, i.e., granite-3.3-2b-instruct-be.F16.gguf.
You may refer to the Getting GGUF Models section to manually convert a safetensors model to GGUF Big Endian.
I'm getting extremely poor performance when running inference on a model
Answer: Please refer to the Appendix B: SIMD Support Matrix to check if your model quantization is supported by SIMD acceleration.
I'm building on IBM z17 and getting the following error messages: invalid switch -march=z17
Answer: Please ensure that your GCC compiler is of minimum GCC 15.1.0 version, and have binutils updated to the latest version. If this does not fix the problem, kindly open an issue.
Failing to install the sentencepiece package using GCC 15+
Answer: The sentencepiece team are aware of this as seen in this issue.
As a temporary workaround, please run the installation command with the following environment variables.
export CXXFLAGS="-include cstdint"
For example,
CXXFLAGS="-include cstdint" pip3 install -r requirements.txt
-DGGML_NNPA=ON generates gibberish output
Answer: We are aware of this as detailed in this issue. Please either try reducing the number of threads, or disable the compile option using -DGGML_NNPA=OFF.
Bugs, Feature Requests
Please file an issue in llama.cpp and ensure that the title contains "s390x".
Other Questions
Please reach out directly to aionz@us.ibm.com.
| Support | Minimum Compiler Version | |
|---|---|---|
| IBM z15 | ✅ | |
| IBM z16 | ✅ | |
| IBM z17 | ✅ | GCC 15.1.0 |
| IBM zDNN | ✅ |
| VX/VXE/VXE2 | NNPA | zDNN | Spyre | |
|---|---|---|---|---|
| FP32 | ✅ | ✅ | ✅ | ❓ |
| FP16 | ✅ | ✅ | ❓ | ❓ |
| BF16 | 🚫 | 🚫 | ❓ | ❓ |
| Q4_0 | ✅ | ✅ | ❓ | ❓ |
| Q4_1 | ✅ | ✅ | ❓ | ❓ |
| Q5_0 | 🚫 | 🚫 | ❓ | ❓ |
| Q5_1 | 🚫 | 🚫 | ❓ | ❓ |
| Q8_0 | ✅ | ✅ | ❓ | ❓ |
| Q2_K | 🚫 | 🚫 | ❓ | ❓ |
| Q3_K | ✅ | ✅ | ❓ | ❓ |
| Q4_K | ✅ | ✅ | ❓ | ❓ |
| Q5_K | ✅ | ✅ | ❓ | ❓ |
| Q6_K | ✅ | ✅ | ❓ | ❓ |
| TQ1_0 | 🚫 | 🚫 | ❓ | ❓ |
| TQ2_0 | 🚫 | 🚫 | ❓ | ❓ |
| IQ2_XXS | 🚫 | 🚫 | ❓ | ❓ |
| IQ2_XS | 🚫 | 🚫 | ❓ | ❓ |
| IQ2_S | 🚫 | 🚫 | ❓ | ❓ |
| IQ3_XXS | 🚫 | 🚫 | ❓ | ❓ |
| IQ3_S | 🚫 | 🚫 | ❓ | ❓ |
| IQ1_S | 🚫 | 🚫 | ❓ | ❓ |
| IQ1_M | 🚫 | 🚫 | ❓ | ❓ |
| IQ4_NL | ✅ | ✅ | ❓ | ❓ |
| IQ4_XS | ✅ | ✅ | ❓ | ❓ |
| FP32->FP16 | 🚫 | ✅ | ❓ | ❓ |
| FP16->FP32 | 🚫 | ✅ | ❓ | ❓ |
Last Updated by Aaron Teo (aaron.teo1@ibm.com) on July 31, 2025.