Brak opisu

Xuan Son Nguyen 6b91b1e0a9 docker : add build for SYCL, Vulkan + update readme (#5228) 1 rok temu
.devops 6b91b1e0a9 docker : add build for SYCL, Vulkan + update readme (#5228) 1 rok temu
.github 1cfb5372cf Fix broken Vulkan Cmake (properly) (#5230) 1 rok temu
awq-py d9aa4ffa6e awq-py : fix typo in awq-py/README.md (#4947) 2 lat temu
ci 0f648573dd ggml : add unified SYCL backend for Intel GPUs (#2690) 2 lat temu
cmake c41ea36eaa cmake : MSVC instruction detection (fixed up #809) (#3923) 2 lat temu
common 5cb04dbc16 llama : remove LLAMA_MAX_DEVICES and LLAMA_SUPPORTS_GPU_OFFLOAD (#5240) 1 rok temu
docs ff8238f71d docs : add llama-star arch idea 2 lat temu
examples af3ba5d946 [SYCL] update guide of SYCL backend (#5254) 1 rok temu
gguf-py ce32060198 llama : support InternLM2 (#5184) 1 rok temu
grammars 532dd74e38 Fix some documentation typos/grammar mistakes (#4032) 2 lat temu
kompute @ 4565194ed7 fbf1ddec69 Nomic Vulkan backend (#4456) 1 rok temu
kompute-shaders fbf1ddec69 Nomic Vulkan backend (#4456) 1 rok temu
media 62b3e81aae media : add logos and banners 2 lat temu
models ea5497df5d gpt2 : Add gpt2 architecture integration (#4555) 2 lat temu
pocs 35a2ee9143 Remove unused data and add fixes (#5154) 2 lat temu
prompts 37c746d687 llama : add Qwen support (#4281) 2 lat temu
requirements 04ac0607e9 python : add check-requirements.sh and GitHub workflow (#4585) 2 lat temu
scripts 01684139c3 support SYCL backend windows build (#5208) 1 rok temu
spm-headers 3ba5b8ca8e swift : pin ggml commit + remove ggml.h from spm-headers (#4878) 2 lat temu
tests 15606309a0 llava : add MobileVLM support (#5132) 1 rok temu
.clang-tidy 00d62adb79 fix some warnings from gcc and clang-tidy (#3038) 2 lat temu
.dockerignore ea55295a74 docker : ignore Git files (#3314) 2 lat temu
.ecrc fbf1ddec69 Nomic Vulkan backend (#4456) 1 rok temu
.editorconfig 800a489e4a llama.swiftui : add bench functionality (#4483) 2 lat temu
.flake8 5ddf7ea1fb hooks : setting up flake8 and pre-commit hooks (#1681) 2 lat temu
.gitignore 01684139c3 support SYCL backend windows build (#5208) 1 rok temu
.gitmodules fbf1ddec69 Nomic Vulkan backend (#4456) 1 rok temu
.pre-commit-config.yaml 5ddf7ea1fb hooks : setting up flake8 and pre-commit hooks (#1681) 2 lat temu
CMakeLists.txt 1cfb5372cf Fix broken Vulkan Cmake (properly) (#5230) 1 rok temu
LICENSE 6a9a67f0be Add LICENSE (#21) 2 lat temu
Makefile d71ac90985 make : generate .a library for static linking (#5205) 1 rok temu
Package.swift b037787548 swift : track ggml release branch (#4867) 2 lat temu
README-sycl.md 6b91b1e0a9 docker : add build for SYCL, Vulkan + update readme (#5228) 1 rok temu
README.md 6b91b1e0a9 docker : add build for SYCL, Vulkan + update readme (#5228) 1 rok temu
SHA256SUMS 31d2b5f4a4 Update SHA256SUMS with current hashes for models quantized using q4_0 (#1798) 2 lat temu
build.zig b12fa0d1c1 build : link against build info instead of compiling against it (#3879) 2 lat temu
codecov.yml 73a12a6344 cov : disable comment in PRs (#2989) 2 lat temu
convert-hf-to-gguf.py ce32060198 llama : support InternLM2 (#5184) 1 rok temu
convert-llama-ggml-to-gguf.py b43ebde3b0 convert : partially revert PR #4818 (#5041) 2 lat temu
convert-lora-to-ggml.py 05490fad7f add safetensors support to convert-lora-to-ggml.py (#5062) 2 lat temu
convert-persimmon-to-gguf.py b43ebde3b0 convert : partially revert PR #4818 (#5041) 2 lat temu
convert.py 14fef85e2d py : fix except (#5194) 1 rok temu
flake.lock b764b8f1d0 flake.lock: Update (#5162) 2 lat temu
flake.nix b2d80e105a flake.nix: add a comment about flakes vs nix 2 lat temu
ggml-alloc.c ceebbb5b21 ggml alloc: Fix for null dereference on alloc failure (#5200) 1 rok temu
ggml-alloc.h e7e4df031b llama : ggml-backend integration (#4766) 2 lat temu
ggml-backend-impl.h 2307523d32 ggml : add Vulkan backend (#2059) 2 lat temu
ggml-backend.c fbf1ddec69 Nomic Vulkan backend (#4456) 1 rok temu
ggml-backend.h 2307523d32 ggml : add Vulkan backend (#2059) 2 lat temu
ggml-cuda.cu 8ca511cade cuda : fix LLAMA_CUDA_F16 (#5262) 1 rok temu
ggml-cuda.h a0b3ac8c48 ggml : introduce GGML_CALL function annotation (#4850) 2 lat temu
ggml-impl.h e7e4df031b llama : ggml-backend integration (#4766) 2 lat temu
ggml-kompute.cpp fbf1ddec69 Nomic Vulkan backend (#4456) 1 rok temu
ggml-kompute.h fbf1ddec69 Nomic Vulkan backend (#4456) 1 rok temu
ggml-metal.h 5f14ee0b0c metal : add debug capture backend function (ggml/694) 1 rok temu
ggml-metal.m efb7bdbbd0 metal : add im2col F32 dst support (#5132) 1 rok temu
ggml-metal.metal efb7bdbbd0 metal : add im2col F32 dst support (#5132) 1 rok temu
ggml-mpi.c 5bf2a27718 ggml : remove src0 and src1 from ggml_tensor and rename opt to src (#2178) 2 lat temu
ggml-mpi.h 5656d10599 mpi : add support for distributed inference via MPI (#2099) 2 lat temu
ggml-opencl.cpp fbe7dfa53c ggml : add max buffer sizes to opencl and metal backends (#5181) 1 rok temu
ggml-opencl.h a1d6df129b Add OpenCL add kernel (#5151) 2 lat temu
ggml-quants.c 8e14e3ddb3 Faster AVX2 dot product for IQ2_XS (#5187) 1 rok temu
ggml-quants.h f4d7e54974 SOTA 3-bit quants (#5196) 1 rok temu
ggml-sycl.cpp e805f0fa99 [SYCL] get MAX_MEM_ALLOC from device property (#5270) 1 rok temu
ggml-sycl.h 128dcbd3c9 add --no-mmap in llama-bench (#5257) 1 rok temu
ggml-vulkan-shaders.hpp 4d0924a890 Vulkan Phi Fix for AMD Proprietary Drivers (#5260) 1 rok temu
ggml-vulkan.cpp 4d0924a890 Vulkan Phi Fix for AMD Proprietary Drivers (#5260) 1 rok temu
ggml-vulkan.h 2307523d32 ggml : add Vulkan backend (#2059) 2 lat temu
ggml.c 15606309a0 llava : add MobileVLM support (#5132) 1 rok temu
ggml.h 15606309a0 llava : add MobileVLM support (#5132) 1 rok temu
ggml_vk_generate_shaders.py 4d0924a890 Vulkan Phi Fix for AMD Proprietary Drivers (#5260) 1 rok temu
llama.cpp e1e721094d llama : fix memory leak in llama_batch_free (#5252) 1 rok temu
llama.h 5cb04dbc16 llama : remove LLAMA_MAX_DEVICES and LLAMA_SUPPORTS_GPU_OFFLOAD (#5240) 1 rok temu
mypy.ini b43ebde3b0 convert : partially revert PR #4818 (#5041) 2 lat temu
requirements.txt 04ac0607e9 python : add check-requirements.sh and GitHub workflow (#4585) 2 lat temu
unicode.h 6c5629d4d2 add `#include <string>` to unicode.h (#5051) 2 lat temu

README-sycl.md

llama.cpp for SYCL

Background

SYCL is a higher-level programming model to improve programming productivity on various hardware accelerators—such as CPUs, GPUs, and FPGAs. It is a single-source embedded domain-specific language based on pure C++17.

oneAPI is a specification that is open and standards-based, supporting multiple architecture types including but not limited to GPU, CPU, and FPGA. The spec has both direct programming and API-based programming paradigms.

Intel uses the SYCL as direct programming language to support CPU, GPUs and FPGAs.

To avoid to re-invent the wheel, this code refer other code paths in llama.cpp (like OpenBLAS, cuBLAS, CLBlast). We use a open-source tool SYCLomatic (Commercial release Intel® DPC++ Compatibility Tool) migrate to SYCL.

The llama.cpp for SYCL is used to support Intel GPUs.

For Intel CPU, recommend to use llama.cpp for X86 (Intel MKL building).

OS

|OS|Status|Verified| |-|-|-| |Linux|Support|Ubuntu 22.04, Fedora Silverblue 39| |Windows|Support|Windows 11|

Intel GPU

Verified

|Intel GPU| Status | Verified Model| |-|-|-| |Intel Data Center Max Series| Support| Max 1550| |Intel Data Center Flex Series| Support| Flex 170| |Intel Arc Series| Support| Arc 770, 730M| |Intel built-in Arc GPU| Support| built-in Arc GPU in Meteor Lake| |Intel iGPU| Support| iGPU in i5-1250P, i7-1260P, i7-1165G7|

Note: If the EUs (Execution Unit) in iGPU is less than 80, the inference speed will be too slow to use.

Memory

The memory is a limitation to run LLM on GPUs.

When run llama.cpp, there is print log to show the applied memory on GPU. You could know how much memory to be used in your case. Like llm_load_tensors: buffer size = 3577.56 MiB.

For iGPU, please make sure the shared memory from host memory is enough. For llama-2-7b.Q4_0, recommend the host memory is 8GB+.

For dGPU, please make sure the device memory is enough. For llama-2-7b.Q4_0, recommend the device memory is 4GB+.

Docker

Note:

  • Only docker on Linux is tested. Docker on WSL may not work.
  • You may need to install Intel GPU driver on the host machine (See the Linux section to know how to do that)

Build the image

You can choose between F16 and F32 build. F16 is faster for long-prompt inference.

# For F16:
#docker build -t llama-cpp-sycl --build-arg="LLAMA_SYCL_F16=ON" -f .devops/main-intel.Dockerfile .

# Or, for F32:
docker build -t llama-cpp-sycl -f .devops/main-intel.Dockerfile .

# Note: you can also use the ".devops/main-server.Dockerfile", which compiles the "server" example

Run

# Firstly, find all the DRI cards:
ls -la /dev/dri
# Then, pick the card that you want to use.

# For example with "/dev/dri/card1"
docker run -it --rm -v "$(pwd):/app:Z" --device /dev/dri/renderD128:/dev/dri/renderD128 --device /dev/dri/card1:/dev/dri/card1 llama-cpp-sycl -m "/app/models/YOUR_MODEL_FILE" -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33

Linux

Setup Environment

  1. Install Intel GPU driver.

a. Please install Intel GPU driver by official guide: Install GPU Drivers.

Note: for iGPU, please install the client GPU driver.

b. Add user to group: video, render.

sudo usermod -aG render username
sudo usermod -aG video username

Note: re-login to enable it.

c. Check

sudo apt install clinfo
sudo clinfo -l

Output (example):

Platform #0: Intel(R) OpenCL Graphics
 `-- Device #0: Intel(R) Arc(TM) A770 Graphics


Platform #0: Intel(R) OpenCL HD Graphics
 `-- Device #0: Intel(R) Iris(R) Xe Graphics [0x9a49]
  1. Install Intel® oneAPI Base toolkit.

a. Please follow the procedure in Get the Intel® oneAPI Base Toolkit .

Recommend to install to default folder: /opt/intel/oneapi.

Following guide use the default folder as example. If you use other folder, please modify the following guide info with your folder.

b. Check

source /opt/intel/oneapi/setvars.sh

sycl-ls

There should be one or more level-zero devices. Please confirm that at least one GPU is present, like [ext_oneapi_level_zero:gpu:0].

Output (example):

[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.10.0.17_160000]
[opencl:cpu:1] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i7-13700K OpenCL 3.0 (Build 0) [2023.16.10.0.17_160000]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics OpenCL 3.0 NEO  [23.30.26918.50]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26918]

  1. Build locally:

Note:

  • You can choose between F16 and F32 build. F16 is faster for long-prompt inference.
  • By default, it will build for all binary files. It will take more time. To reduce the time, we recommend to build for example/main only.

    mkdir -p build
    cd build
    source /opt/intel/oneapi/setvars.sh
    
    # For FP16:
    #cmake .. -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DLLAMA_SYCL_F16=ON
    
    # Or, for FP32:
    cmake .. -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx
    
    # Build example/main only
    #cmake --build . --config Release --target main
    
    # Or, build all binary
    cmake --build . --config Release -v
    
    cd ..
    

or

./examples/sycl/build.sh

Run

  1. Put model file to folder models

You could download llama-2-7b.Q4_0.gguf as example.

  1. Enable oneAPI running environment

    source /opt/intel/oneapi/setvars.sh
    
  2. List device ID

Run without parameter:

./build/bin/ls-sycl-device

# or running the "main" executable and look at the output log:

./build/bin/main

Check the ID in startup log, like:

found 4 SYCL devices:
  Device 0: Intel(R) Arc(TM) A770 Graphics,	compute capability 1.3,
    max compute_units 512,	max work group size 1024,	max sub group size 32,	global mem size 16225243136
  Device 1: Intel(R) FPGA Emulation Device,	compute capability 1.2,
    max compute_units 24,	max work group size 67108864,	max sub group size 64,	global mem size 67065057280
  Device 2: 13th Gen Intel(R) Core(TM) i7-13700K,	compute capability 3.0,
    max compute_units 24,	max work group size 8192,	max sub group size 64,	global mem size 67065057280
  Device 3: Intel(R) Arc(TM) A770 Graphics,	compute capability 3.0,
    max compute_units 512,	max work group size 1024,	max sub group size 32,	global mem size 16225243136

|Attribute|Note| |-|-| |compute capability 1.3|Level-zero running time, recommended | |compute capability 3.0|OpenCL running time, slower than level-zero in most cases|

  1. Set device ID and execute llama.cpp

Set device ID = 0 by GGML_SYCL_DEVICE=0

GGML_SYCL_DEVICE=0 ./build/bin/main -m models/llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33

or run by script:

./examples/sycl/run_llama2.sh

Note:

  • By default, mmap is used to read model file. In some cases, it leads to the hang issue. Recommend to use parameter --no-mmap to disable mmap() to skip this issue.
  1. Check the device ID in output

Like:

Using device **0** (Intel(R) Arc(TM) A770 Graphics) as main device

Windows

Setup Environment

  1. Install Intel GPU driver.

Please install Intel GPU driver by official guide: Install GPU Drivers.

Note: The driver is mandatory for compute function.

  1. Install Visual Studio.

Please install Visual Studio which impact oneAPI environment enabling in Windows.

  1. Install Intel® oneAPI Base toolkit.

a. Please follow the procedure in Get the Intel® oneAPI Base Toolkit .

Recommend to install to default folder: /opt/intel/oneapi.

Following guide uses the default folder as example. If you use other folder, please modify the following guide info with your folder.

b. Enable oneAPI running environment:

  • In Search, input 'oneAPI'.

Search & open "Intel oneAPI command prompt for Intel 64 for Visual Studio 2022"

  • In Run:

In CMD:

"C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64

c. Check GPU

In oneAPI command line:

sycl-ls

There should be one or more level-zero devices. Please confirm that at least one GPU is present, like [ext_oneapi_level_zero:gpu:0].

Output (example):

[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.10.0.17_160000]
[opencl:cpu:1] Intel(R) OpenCL, 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz OpenCL 3.0 (Build 0) [2023.16.10.0.17_160000]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Iris(R) Xe Graphics OpenCL 3.0 NEO  [31.0.101.5186]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Iris(R) Xe Graphics 1.3 [1.3.28044]
  1. Install cmake & make

a. Download & install cmake for Windows: https://cmake.org/download/

b. Download & install make for Windows provided by mingw-w64

Like x86_64-13.2.0-release-win32-seh-msvcrt-rt_v11-rev1.7z.

  • Unzip the binary package. In the bin sub-folder and rename xxx-make.exe to make.exe.

  • Add the bin folder path in the Windows system PATH environment.

Build locally:

In oneAPI command line window:

mkdir -p build
cd build
@call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64 --force

::  for FP16
::  faster for long-prompt inference
::  cmake -G "MinGW Makefiles" ..  -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icx  -DCMAKE_BUILD_TYPE=Release -DLLAMA_SYCL_F16=ON

::  for FP32
cmake -G "MinGW Makefiles" ..  -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icx  -DCMAKE_BUILD_TYPE=Release


::  build example/main only
::  make main

::  build all binary
make -j
cd ..

or

.\examples\sycl\win-build-sycl.bat

Note:

  • By default, it will build for all binary files. It will take more time. To reduce the time, we recommend to build for example/main only.

Run

  1. Put model file to folder models

You could download llama-2-7b.Q4_0.gguf as example.

  1. Enable oneAPI running environment
  • In Search, input 'oneAPI'.

Search & open "Intel oneAPI command prompt for Intel 64 for Visual Studio 2022"

  • In Run:

In CMD:

"C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64
  1. List device ID

Run without parameter:

build\bin\ls-sycl-device.exe

or

build\bin\main.exe

Check the ID in startup log, like:

found 4 SYCL devices:
  Device 0: Intel(R) Arc(TM) A770 Graphics,	compute capability 1.3,
    max compute_units 512,	max work group size 1024,	max sub group size 32,	global mem size 16225243136
  Device 1: Intel(R) FPGA Emulation Device,	compute capability 1.2,
    max compute_units 24,	max work group size 67108864,	max sub group size 64,	global mem size 67065057280
  Device 2: 13th Gen Intel(R) Core(TM) i7-13700K,	compute capability 3.0,
    max compute_units 24,	max work group size 8192,	max sub group size 64,	global mem size 67065057280
  Device 3: Intel(R) Arc(TM) A770 Graphics,	compute capability 3.0,
    max compute_units 512,	max work group size 1024,	max sub group size 32,	global mem size 16225243136

|Attribute|Note| |-|-| |compute capability 1.3|Level-zero running time, recommended | |compute capability 3.0|OpenCL running time, slower than level-zero in most cases|

  1. Set device ID and execute llama.cpp

Set device ID = 0 by set GGML_SYCL_DEVICE=0

set GGML_SYCL_DEVICE=0
build\bin\main.exe -m models\llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e -ngl 33 -s 0

or run by script:

.\examples\sycl\win-run-llama2.bat

Note:

  • By default, mmap is used to read model file. In some cases, it leads to the hang issue. Recommend to use parameter --no-mmap to disable mmap() to skip this issue.
  1. Check the device ID in output

Like:

Using device **0** (Intel(R) Arc(TM) A770 Graphics) as main device

Environment Variable

Build

|Name|Value|Function| |-|-|-| |LLAMA_SYCL|ON (mandatory)|Enable build with SYCL code path.
For FP32/FP16, LLAMA_SYCL=ON is mandatory.| |LLAMA_SYCL_F16|ON (optional)|Enable FP16 build with SYCL code path. Faster for long-prompt inference.
For FP32, not set it.| |CMAKE_C_COMPILER|icx|Use icx compiler for SYCL code path| |CMAKE_CXX_COMPILER|icpx (Linux), icx (Windows)|use icpx/icx for SYCL code path|

Running

|Name|Value|Function| |-|-|-| |GGML_SYCL_DEVICE|0 (default) or 1|Set the device id used. Check the device ids by default running output| |GGML_SYCL_DEBUG|0 (default) or 1|Enable log function by macro: GGML_SYCL_DEBUG|

Known Issue

  • Hang during startup

llama.cpp use mmap as default way to read model file and copy to GPU. In some system, memcpy will be abnormal and block.

Solution: add --no-mmap or --mmap 0.

Q&A

  • Error: error while loading shared libraries: libsycl.so.7: cannot open shared object file: No such file or directory.

Miss to enable oneAPI running environment.

Install oneAPI base toolkit and enable it by: source /opt/intel/oneapi/setvars.sh.

  • In Windows, no result, not error.

Miss to enable oneAPI running environment.

  • Meet compile error.

Remove folder build and try again.

  • I can not see [ext_oneapi_level_zero:gpu:0] afer install GPU driver in Linux.

Please run sudo sycl-ls.

If you see it in result, please add video/render group to your ID:

  sudo usermod -aG render username
  sudo usermod -aG video username

Then relogin.

If you do not see it, please check the installation GPU steps again.

Todo

  • Support multiple cards.