|
|
@@ -42,18 +42,6 @@ cmake --build build --config Release -j $(nproc)
|
|
|
cmake --build build --config Release -j $(nproc)
|
|
|
```
|
|
|
|
|
|
-- By default, NNPA is disabled by default. To enable it:
|
|
|
-
|
|
|
- ```bash
|
|
|
- cmake -S . -B build \
|
|
|
- -DCMAKE_BUILD_TYPE=Release \
|
|
|
- -DGGML_BLAS=ON \
|
|
|
- -DGGML_BLAS_VENDOR=OpenBLAS \
|
|
|
- -DGGML_NNPA=ON
|
|
|
-
|
|
|
- cmake --build build --config Release -j $(nproc)
|
|
|
- ```
|
|
|
-
|
|
|
- For debug builds:
|
|
|
|
|
|
```bash
|
|
|
@@ -164,15 +152,11 @@ All models need to be converted to Big-Endian. You can achieve this in three cas
|
|
|
|
|
|
Only available in IBM z15/LinuxONE 3 or later system with the `-DGGML_VXE=ON` (turned on by default) compile flag. No hardware acceleration is possible with llama.cpp with older systems, such as IBM z14/arch12. In such systems, the APIs can still run but will use a scalar implementation.
|
|
|
|
|
|
-### 2. NNPA Vector Intrinsics Acceleration
|
|
|
-
|
|
|
-Only available in IBM z16/LinuxONE 4 or later system with the `-DGGML_NNPA=ON` (turned off by default) compile flag. No hardware acceleration is possible with llama.cpp with older systems, such as IBM z15/arch13. In such systems, the APIs can still run but will use a scalar implementation.
|
|
|
-
|
|
|
-### 3. zDNN Accelerator (WIP)
|
|
|
+### 2. zDNN Accelerator (WIP)
|
|
|
|
|
|
Only available in IBM z17/LinuxONE 5 or later system with the `-DGGML_ZDNN=ON` compile flag. No hardware acceleration is possible with llama.cpp with older systems, such as IBM z15/arch13. In such systems, the APIs will default back to CPU routines.
|
|
|
|
|
|
-### 4. Spyre Accelerator
|
|
|
+### 3. Spyre Accelerator
|
|
|
|
|
|
_Only available with IBM z17 / LinuxONE 5 or later system. No support currently available._
|
|
|
|
|
|
@@ -230,10 +214,6 @@ IBM VXE/VXE2 SIMD acceleration depends on the BLAS implementation. It is strongl
|
|
|
CXXFLAGS="-include cstdint" pip3 install -r requirements.txt
|
|
|
```
|
|
|
|
|
|
-5. `-DGGML_NNPA=ON` generates gibberish output
|
|
|
-
|
|
|
- Answer: We are aware of this as detailed in [this issue](https://github.com/ggml-org/llama.cpp/issues/14877). Please either try reducing the number of threads, or disable the compile option using `-DGGML_NNPA=OFF`.
|
|
|
-
|
|
|
## Getting Help on IBM Z & LinuxONE
|
|
|
|
|
|
1. **Bugs, Feature Requests**
|
|
|
@@ -258,38 +238,38 @@ IBM VXE/VXE2 SIMD acceleration depends on the BLAS implementation. It is strongl
|
|
|
|
|
|
## Appendix B: SIMD Support Matrix
|
|
|
|
|
|
-| | VX/VXE/VXE2 | NNPA | zDNN | Spyre |
|
|
|
-| ---------- | ----------- | ---- | ---- | ----- |
|
|
|
-| FP32 | ✅ | ✅ | ✅ | ❓ |
|
|
|
-| FP16 | ✅ | ✅ | ❓ | ❓ |
|
|
|
-| BF16 | 🚫 | 🚫 | ❓ | ❓ |
|
|
|
-| Q4_0 | ✅ | ✅ | ❓ | ❓ |
|
|
|
-| Q4_1 | ✅ | ✅ | ❓ | ❓ |
|
|
|
-| MXFP4 | 🚫 | 🚫 | ❓ | ❓ |
|
|
|
-| Q5_0 | ✅ | ✅ | ❓ | ❓ |
|
|
|
-| Q5_1 | ✅ | ✅ | ❓ | ❓ |
|
|
|
-| Q8_0 | ✅ | ✅ | ❓ | ❓ |
|
|
|
-| Q2_K | 🚫 | 🚫 | ❓ | ❓ |
|
|
|
-| Q3_K | ✅ | ✅ | ❓ | ❓ |
|
|
|
-| Q4_K | ✅ | ✅ | ❓ | ❓ |
|
|
|
-| Q5_K | ✅ | ✅ | ❓ | ❓ |
|
|
|
-| Q6_K | ✅ | ✅ | ❓ | ❓ |
|
|
|
-| TQ1_0 | 🚫 | 🚫 | ❓ | ❓ |
|
|
|
-| TQ2_0 | 🚫 | 🚫 | ❓ | ❓ |
|
|
|
-| IQ2_XXS | 🚫 | 🚫 | ❓ | ❓ |
|
|
|
-| IQ2_XS | 🚫 | 🚫 | ❓ | ❓ |
|
|
|
-| IQ2_S | 🚫 | 🚫 | ❓ | ❓ |
|
|
|
-| IQ3_XXS | 🚫 | 🚫 | ❓ | ❓ |
|
|
|
-| IQ3_S | 🚫 | 🚫 | ❓ | ❓ |
|
|
|
-| IQ1_S | 🚫 | 🚫 | ❓ | ❓ |
|
|
|
-| IQ1_M | 🚫 | 🚫 | ❓ | ❓ |
|
|
|
-| IQ4_NL | ✅ | ✅ | ❓ | ❓ |
|
|
|
-| IQ4_XS | ✅ | ✅ | ❓ | ❓ |
|
|
|
-| FP32->FP16 | 🚫 | ✅ | ❓ | ❓ |
|
|
|
-| FP16->FP32 | 🚫 | ✅ | ❓ | ❓ |
|
|
|
+| | VX/VXE/VXE2 | zDNN | Spyre |
|
|
|
+|------------|-------------|------|-------|
|
|
|
+| FP32 | ✅ | ✅ | ❓ |
|
|
|
+| FP16 | ✅ | ❓ | ❓ |
|
|
|
+| BF16 | 🚫 | ❓ | ❓ |
|
|
|
+| Q4_0 | ✅ | ❓ | ❓ |
|
|
|
+| Q4_1 | ✅ | ❓ | ❓ |
|
|
|
+| MXFP4 | 🚫 | ❓ | ❓ |
|
|
|
+| Q5_0 | ✅ | ❓ | ❓ |
|
|
|
+| Q5_1 | ✅ | ❓ | ❓ |
|
|
|
+| Q8_0 | ✅ | ❓ | ❓ |
|
|
|
+| Q2_K | 🚫 | ❓ | ❓ |
|
|
|
+| Q3_K | ✅ | ❓ | ❓ |
|
|
|
+| Q4_K | ✅ | ❓ | ❓ |
|
|
|
+| Q5_K | ✅ | ❓ | ❓ |
|
|
|
+| Q6_K | ✅ | ❓ | ❓ |
|
|
|
+| TQ1_0 | 🚫 | ❓ | ❓ |
|
|
|
+| TQ2_0 | 🚫 | ❓ | ❓ |
|
|
|
+| IQ2_XXS | 🚫 | ❓ | ❓ |
|
|
|
+| IQ2_XS | 🚫 | ❓ | ❓ |
|
|
|
+| IQ2_S | 🚫 | ❓ | ❓ |
|
|
|
+| IQ3_XXS | 🚫 | ❓ | ❓ |
|
|
|
+| IQ3_S | 🚫 | ❓ | ❓ |
|
|
|
+| IQ1_S | 🚫 | ❓ | ❓ |
|
|
|
+| IQ1_M | 🚫 | ❓ | ❓ |
|
|
|
+| IQ4_NL | ✅ | ❓ | ❓ |
|
|
|
+| IQ4_XS | ✅ | ❓ | ❓ |
|
|
|
+| FP32->FP16 | 🚫 | ❓ | ❓ |
|
|
|
+| FP16->FP32 | 🚫 | ❓ | ❓ |
|
|
|
|
|
|
- ✅ - acceleration available
|
|
|
- 🚫 - acceleration unavailable, will still run using scalar implementation
|
|
|
- ❓ - acceleration unknown, please contribute if you can test it yourself
|
|
|
|
|
|
-Last Updated by **Aaron Teo (aaron.teo1@ibm.com)** on Aug 22, 2025.
|
|
|
+Last Updated by **Aaron Teo (aaron.teo1@ibm.com)** on Sep 6, 2025.
|