Преглед изворни кода

Arm AArch64: Documentation updates (#9321)

* Arm AArch64: Documentation updates

* Update docs/build.md to include information on how to enable the Arm optimized gemm/gemv kernels

* Update examples/quantize/README.md with information on the Q4_0_4_4, Q4_0_4_8 and Q4_0_8_8 formats

* Add newline to the end of docs/build.md
Dan Johansson пре 1 година
родитељ
комит
b2e89a3274
2 измењених фајлова са 8 додато и 0 уклоњено
  1. 6 0
      docs/build.md
  2. 2 0
      examples/quantize/README.md

+ 6 - 0
docs/build.md

@@ -380,3 +380,9 @@ For detailed info, such as model/device supports, CANN install, please refer to
 ### Android
 ### Android
 
 
 To read documentation for how to build on Android, [click here](./android.md)
 To read documentation for how to build on Android, [click here](./android.md)
+
+### Arm CPU optimized mulmat kernels
+
+Llama.cpp includes a set of optimized mulmat kernels for the Arm architecture, leveraging Arm® Neon™, int8mm and SVE instructions. These kernels are enabled at build time through the appropriate compiler cpu-type flags, such as `-DCMAKE_C_FLAGS=-march=armv8.2a+i8mm+sve`. Note that these optimized kernels require the model to be quantized into one of the formats: `Q4_0_4_4` (Arm Neon), `Q4_0_4_8` (int8mm) or `Q4_0_8_8` (SVE). The SVE mulmat kernel specifically requires a vector width of 256 bits. When running on devices with a different vector width, it is recommended to use the `Q4_0_4_8` (int8mm) or `Q4_0_4_4` (Arm Neon) formats for better performance. Refer to [examples/quantize/README.md](../examples/quantize/README.md) for more information on the quantization formats.
+
+To support `Q4_0_4_4`, you must build with `GGML_NO_LLAMAFILE=1` (`make`) or `-DGGML_LLAMAFILE=OFF` (`cmake`).

+ 2 - 0
examples/quantize/README.md

@@ -54,6 +54,8 @@ As the models are currently fully loaded into memory, you will need adequate dis
 
 
 Several quantization methods are supported. They differ in the resulting model disk size and inference speed.
 Several quantization methods are supported. They differ in the resulting model disk size and inference speed.
 
 
+The quantization formats `Q4_0_4_4`, `Q4_0_4_8` and `Q4_0_8_8` are block interleaved variants of the `Q4_0` format, providing a data layout that is better suited for specific implementations of optimized mulmat kernels. Since these formats differ only in data layout, they have the same quantized size as the `Q4_0` format.
+
 *(outdated)*
 *(outdated)*
 
 
 | Model | Measure      |    F16 |   Q4_0 |   Q4_1 |   Q5_0 |   Q5_1 |   Q8_0 |
 | Model | Measure      |    F16 |   Q4_0 |   Q4_1 |   Q5_0 |   Q5_1 |   Q8_0 |