Bez popisu

Yiming Cui d62520eb2c Fix typos of IQ2_XXS and IQ3_XXS in llama.cpp (#5231) před 1 rokem
.devops 39baaf55a1 docker : add server-first container images (#5157) před 2 roky
.github 01684139c3 support SYCL backend windows build (#5208) před 1 rokem
awq-py d9aa4ffa6e awq-py : fix typo in awq-py/README.md (#4947) před 2 roky
ci 0f648573dd ggml : add unified SYCL backend for Intel GPUs (#2690) před 2 roky
cmake c41ea36eaa cmake : MSVC instruction detection (fixed up #809) (#3923) před 2 roky
common e8dc55d006 kompute : llama-bench support and ggml_cpu_has_kompute() (#5226) před 1 rokem
docs ff8238f71d docs : add llama-star arch idea před 2 roky
examples 01684139c3 support SYCL backend windows build (#5208) před 1 rokem
gguf-py f2e69d28c0 llama : add support for Orion-14B (#5118) před 2 roky
grammars 532dd74e38 Fix some documentation typos/grammar mistakes (#4032) před 2 roky
kompute @ 4565194ed7 fbf1ddec69 Nomic Vulkan backend (#4456) před 1 rokem
kompute-shaders fbf1ddec69 Nomic Vulkan backend (#4456) před 1 rokem
media 62b3e81aae media : add logos and banners před 2 roky
models ea5497df5d gpt2 : Add gpt2 architecture integration (#4555) před 2 roky
pocs 35a2ee9143 Remove unused data and add fixes (#5154) před 2 roky
prompts 37c746d687 llama : add Qwen support (#4281) před 2 roky
requirements 04ac0607e9 python : add check-requirements.sh and GitHub workflow (#4585) před 2 roky
scripts 01684139c3 support SYCL backend windows build (#5208) před 1 rokem
spm-headers 3ba5b8ca8e swift : pin ggml commit + remove ggml.h from spm-headers (#4878) před 2 roky
tests 625a699b54 `ggml_cuda_cpy` support for 4d tensors and float16->float32 upcasting (ggml/686) před 1 rokem
.clang-tidy 00d62adb79 fix some warnings from gcc and clang-tidy (#3038) před 2 roky
.dockerignore ea55295a74 docker : ignore Git files (#3314) před 2 roky
.ecrc fbf1ddec69 Nomic Vulkan backend (#4456) před 1 rokem
.editorconfig 800a489e4a llama.swiftui : add bench functionality (#4483) před 2 roky
.flake8 5ddf7ea1fb hooks : setting up flake8 and pre-commit hooks (#1681) před 2 roky
.gitignore 01684139c3 support SYCL backend windows build (#5208) před 1 rokem
.gitmodules fbf1ddec69 Nomic Vulkan backend (#4456) před 1 rokem
.pre-commit-config.yaml 5ddf7ea1fb hooks : setting up flake8 and pre-commit hooks (#1681) před 2 roky
CMakeLists.txt 01684139c3 support SYCL backend windows build (#5208) před 1 rokem
LICENSE 6a9a67f0be Add LICENSE (#21) před 2 roky
Makefile 2307523d32 ggml : add Vulkan backend (#2059) před 2 roky
Package.swift b037787548 swift : track ggml release branch (#4867) před 2 roky
README-sycl.md 01684139c3 support SYCL backend windows build (#5208) před 1 rokem
README.md 01684139c3 support SYCL backend windows build (#5208) před 1 rokem
SHA256SUMS 31d2b5f4a4 Update SHA256SUMS with current hashes for models quantized using q4_0 (#1798) před 2 roky
build.zig b12fa0d1c1 build : link against build info instead of compiling against it (#3879) před 2 roky
codecov.yml 73a12a6344 cov : disable comment in PRs (#2989) před 2 roky
convert-hf-to-gguf.py f2e69d28c0 llama : add support for Orion-14B (#5118) před 2 roky
convert-llama-ggml-to-gguf.py b43ebde3b0 convert : partially revert PR #4818 (#5041) před 2 roky
convert-lora-to-ggml.py 05490fad7f add safetensors support to convert-lora-to-ggml.py (#5062) před 2 roky
convert-persimmon-to-gguf.py b43ebde3b0 convert : partially revert PR #4818 (#5041) před 2 roky
convert.py 14fef85e2d py : fix except (#5194) před 1 rokem
flake.lock b764b8f1d0 flake.lock: Update (#5162) před 2 roky
flake.nix b2d80e105a flake.nix: add a comment about flakes vs nix před 2 roky
ggml-alloc.c ceebbb5b21 ggml alloc: Fix for null dereference on alloc failure (#5200) před 1 rokem
ggml-alloc.h e7e4df031b llama : ggml-backend integration (#4766) před 2 roky
ggml-backend-impl.h 2307523d32 ggml : add Vulkan backend (#2059) před 2 roky
ggml-backend.c fbf1ddec69 Nomic Vulkan backend (#4456) před 1 rokem
ggml-backend.h 2307523d32 ggml : add Vulkan backend (#2059) před 2 roky
ggml-cuda.cu 8f8ddfcfad sync : ggml (#0) před 1 rokem
ggml-cuda.h a0b3ac8c48 ggml : introduce GGML_CALL function annotation (#4850) před 2 roky
ggml-impl.h e7e4df031b llama : ggml-backend integration (#4766) před 2 roky
ggml-kompute.cpp fbf1ddec69 Nomic Vulkan backend (#4456) před 1 rokem
ggml-kompute.h fbf1ddec69 Nomic Vulkan backend (#4456) před 1 rokem
ggml-metal.h 5f14ee0b0c metal : add debug capture backend function (ggml/694) před 1 rokem
ggml-metal.m 549a1e6cd5 ci : fix yolo URLs + fix metal capture (ggml/712) před 1 rokem
ggml-metal.metal fea4fd4ba7 ggml : fix IQ3_XXS on Metal (#5219) před 1 rokem
ggml-mpi.c 5bf2a27718 ggml : remove src0 and src1 from ggml_tensor and rename opt to src (#2178) před 2 roky
ggml-mpi.h 5656d10599 mpi : add support for distributed inference via MPI (#2099) před 2 roky
ggml-opencl.cpp fbe7dfa53c ggml : add max buffer sizes to opencl and metal backends (#5181) před 1 rokem
ggml-opencl.h a1d6df129b Add OpenCL add kernel (#5151) před 2 roky
ggml-quants.c 8e14e3ddb3 Faster AVX2 dot product for IQ2_XS (#5187) před 1 rokem
ggml-quants.h f4d7e54974 SOTA 3-bit quants (#5196) před 1 rokem
ggml-sycl.cpp 2307523d32 ggml : add Vulkan backend (#2059) před 2 roky
ggml-sycl.h 0f648573dd ggml : add unified SYCL backend for Intel GPUs (#2690) před 2 roky
ggml-vulkan-shaders.hpp 2307523d32 ggml : add Vulkan backend (#2059) před 2 roky
ggml-vulkan.cpp 2256f36b79 Vulkan Windows APU Memory Handling (#5199) před 1 rokem
ggml-vulkan.h 2307523d32 ggml : add Vulkan backend (#2059) před 2 roky
ggml.c e8dc55d006 kompute : llama-bench support and ggml_cpu_has_kompute() (#5226) před 1 rokem
ggml.h e8dc55d006 kompute : llama-bench support and ggml_cpu_has_kompute() (#5226) před 1 rokem
ggml_vk_generate_shaders.py 2307523d32 ggml : add Vulkan backend (#2059) před 2 roky
llama.cpp d62520eb2c Fix typos of IQ2_XXS and IQ3_XXS in llama.cpp (#5231) před 1 rokem
llama.h f4d7e54974 SOTA 3-bit quants (#5196) před 1 rokem
mypy.ini b43ebde3b0 convert : partially revert PR #4818 (#5041) před 2 roky
requirements.txt 04ac0607e9 python : add check-requirements.sh and GitHub workflow (#4585) před 2 roky
unicode.h 6c5629d4d2 add `#include <string>` to unicode.h (#5051) před 2 roky

README-sycl.md

llama.cpp for SYCL

Background

OS

Intel GPU

Linux

Windows

Environment Variable

Known Issue

Q&A

Todo

Background

SYCL is a higher-level programming model to improve programming productivity on various hardware accelerators—such as CPUs, GPUs, and FPGAs. It is a single-source embedded domain-specific language based on pure C++17.

oneAPI is a specification that is open and standards-based, supporting multiple architecture types including but not limited to GPU, CPU, and FPGA. The spec has both direct programming and API-based programming paradigms.

Intel uses the SYCL as direct programming language to support CPU, GPUs and FPGAs.

To avoid to re-invent the wheel, this code refer other code paths in llama.cpp (like OpenBLAS, cuBLAS, CLBlast). We use a open-source tool SYCLomatic (Commercial release Intel® DPC++ Compatibility Tool) migrate to SYCL.

The llama.cpp for SYCL is used to support Intel GPUs.

For Intel CPU, recommend to use llama.cpp for X86 (Intel MKL building).

OS

|OS|Status|Verified| |-|-|-| |Linux|Support|Ubuntu 22.04| |Windows|Support|Windows 11|

Intel GPU

|Intel GPU| Status | Verified Model| |-|-|-| |Intel Data Center Max Series| Support| Max 1550| |Intel Data Center Flex Series| Support| Flex 170| |Intel Arc Series| Support| Arc 770, 730M| |Intel built-in Arc GPU| Support| built-in Arc GPU in Meteor Lake| |Intel iGPU| Support| iGPU in i5-1250P, i7-1165G7|

Linux

Setup Environment

  1. Install Intel GPU driver.

a. Please install Intel GPU driver by official guide: Install GPU Drivers.

Note: for iGPU, please install the client GPU driver.

b. Add user to group: video, render.

sudo usermod -aG render username
sudo usermod -aG video username

Note: re-login to enable it.

c. Check

sudo apt install clinfo
sudo clinfo -l

Output (example):

Platform #0: Intel(R) OpenCL Graphics
 `-- Device #0: Intel(R) Arc(TM) A770 Graphics


Platform #0: Intel(R) OpenCL HD Graphics
 `-- Device #0: Intel(R) Iris(R) Xe Graphics [0x9a49]
  1. Install Intel® oneAPI Base toolkit.

a. Please follow the procedure in Get the Intel® oneAPI Base Toolkit .

Recommend to install to default folder: /opt/intel/oneapi.

Following guide use the default folder as example. If you use other folder, please modify the following guide info with your folder.

b. Check

source /opt/intel/oneapi/setvars.sh

sycl-ls

There should be one or more level-zero devices. Like [ext_oneapi_level_zero:gpu:0].

Output (example):

[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.10.0.17_160000]
[opencl:cpu:1] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i7-13700K OpenCL 3.0 (Build 0) [2023.16.10.0.17_160000]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics OpenCL 3.0 NEO  [23.30.26918.50]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26918]

  1. Build locally:

    mkdir -p build
    cd build
    source /opt/intel/oneapi/setvars.sh
    
    #for FP16
    #cmake .. -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DLLAMA_SYCL_F16=ON # faster for long-prompt inference
    
    #for FP32
    cmake .. -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx
    
    #build example/main only
    #cmake --build . --config Release --target main
    
    #build all binary
    cmake --build . --config Release -v
    
    cd ..
    

or

./examples/sycl/build.sh

Note:

  • By default, it will build for all binary files. It will take more time. To reduce the time, we recommend to build for example/main only.

Run

  1. Put model file to folder models

  2. Enable oneAPI running environment

    source /opt/intel/oneapi/setvars.sh
    
  3. List device ID

Run without parameter:

./build/bin/ls-sycl-device

or

./build/bin/main

Check the ID in startup log, like:

found 4 SYCL devices:
  Device 0: Intel(R) Arc(TM) A770 Graphics,	compute capability 1.3,
    max compute_units 512,	max work group size 1024,	max sub group size 32,	global mem size 16225243136
  Device 1: Intel(R) FPGA Emulation Device,	compute capability 1.2,
    max compute_units 24,	max work group size 67108864,	max sub group size 64,	global mem size 67065057280
  Device 2: 13th Gen Intel(R) Core(TM) i7-13700K,	compute capability 3.0,
    max compute_units 24,	max work group size 8192,	max sub group size 64,	global mem size 67065057280
  Device 3: Intel(R) Arc(TM) A770 Graphics,	compute capability 3.0,
    max compute_units 512,	max work group size 1024,	max sub group size 32,	global mem size 16225243136

|Attribute|Note| |-|-| |compute capability 1.3|Level-zero running time, recommended | |compute capability 3.0|OpenCL running time, slower than level-zero in most cases|

  1. Set device ID and execute llama.cpp

Set device ID = 0 by GGML_SYCL_DEVICE=0

GGML_SYCL_DEVICE=0 ./build/bin/main -m models/llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33

or run by script:

./examples/sycl/run-llama2.sh

Note:

  • By default, mmap is used to read model file. In some cases, it leads to the hang issue. Recommend to use parameter --no-mmap to disable mmap() to skip this issue.
  1. Check the device ID in output

Like:

Using device **0** (Intel(R) Arc(TM) A770 Graphics) as main device

Windows

Setup Environment

  1. Install Intel GPU driver.

Please install Intel GPU driver by official guide: Install GPU Drivers.

  1. Install Intel® oneAPI Base toolkit.

a. Please follow the procedure in Get the Intel® oneAPI Base Toolkit .

Recommend to install to default folder: /opt/intel/oneapi.

Following guide uses the default folder as example. If you use other folder, please modify the following guide info with your folder.

b. Enable oneAPI running environment:

  • In Search, input 'oneAPI'.

Search & open "Intel oneAPI command prompt for Intel 64 for Visual Studio 2022"

  • In Run:

In CMD:

"C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64

c. Check GPU

In oneAPI command line:

sycl-ls

There should be one or more level-zero devices. Like [ext_oneapi_level_zero:gpu:0].

Output (example):

[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.10.0.17_160000]
[opencl:cpu:1] Intel(R) OpenCL, 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz OpenCL 3.0 (Build 0) [2023.16.10.0.17_160000]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Iris(R) Xe Graphics OpenCL 3.0 NEO  [31.0.101.5186]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Iris(R) Xe Graphics 1.3 [1.3.28044]

  1. Install cmake & make

a. Download & install cmake for windows: https://cmake.org/download/

b. Download & install make for windows provided by mingw-w64: https://www.mingw-w64.org/downloads/

Build locally:

In oneAPI command line window:

mkdir -p build
cd build
@call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64 --force

::  for FP16
::  faster for long-prompt inference
::  cmake -G "MinGW Makefiles" ..  -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icx  -DCMAKE_BUILD_TYPE=Release -DLLAMA_SYCL_F16=ON

::  for FP32
cmake -G "MinGW Makefiles" ..  -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icx  -DCMAKE_BUILD_TYPE=Release


::  build example/main only
::  make main

::  build all binary
make -j
cd ..

or

.\examples\sycl\win-build-sycl.bat

Note:

  • By default, it will build for all binary files. It will take more time. To reduce the time, we recommend to build for example/main only.

Run

  1. Put model file to folder models

  2. Enable oneAPI running environment

  • In Search, input 'oneAPI'.

Search & open "Intel oneAPI command prompt for Intel 64 for Visual Studio 2022"

  • In Run:

In CMD:

"C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64
  1. List device ID

Run without parameter:

build\bin\ls-sycl-device.exe

or

build\bin\main.exe

Check the ID in startup log, like:

found 4 SYCL devices:
  Device 0: Intel(R) Arc(TM) A770 Graphics,	compute capability 1.3,
    max compute_units 512,	max work group size 1024,	max sub group size 32,	global mem size 16225243136
  Device 1: Intel(R) FPGA Emulation Device,	compute capability 1.2,
    max compute_units 24,	max work group size 67108864,	max sub group size 64,	global mem size 67065057280
  Device 2: 13th Gen Intel(R) Core(TM) i7-13700K,	compute capability 3.0,
    max compute_units 24,	max work group size 8192,	max sub group size 64,	global mem size 67065057280
  Device 3: Intel(R) Arc(TM) A770 Graphics,	compute capability 3.0,
    max compute_units 512,	max work group size 1024,	max sub group size 32,	global mem size 16225243136

|Attribute|Note| |-|-| |compute capability 1.3|Level-zero running time, recommended | |compute capability 3.0|OpenCL running time, slower than level-zero in most cases|

  1. Set device ID and execute llama.cpp

Set device ID = 0 by set GGML_SYCL_DEVICE=0

set GGML_SYCL_DEVICE=0
build\bin\main.exe -m models\llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e -ngl 33 -s 0

or run by script:

.\examples\sycl\win-run-llama2.bat

Note:

  • By default, mmap is used to read model file. In some cases, it leads to the hang issue. Recommend to use parameter --no-mmap to disable mmap() to skip this issue.
  1. Check the device ID in output

Like:

Using device **0** (Intel(R) Arc(TM) A770 Graphics) as main device

Environment Variable

Build

|Name|Value|Function| |-|-|-| |LLAMA_SYCL|ON (mandatory)|Enable build with SYCL code path.
For FP32/FP16, LLAMA_SYCL=ON is mandatory.| |LLAMA_SYCL_F16|ON (optional)|Enable FP16 build with SYCL code path. Faster for long-prompt inference.
For FP32, not set it.| |CMAKE_C_COMPILER|icx|Use icx compiler for SYCL code path| |CMAKE_CXX_COMPILER|icpx (Linux), icx (Windows)|use icpx/icx for SYCL code path|

Running

|Name|Value|Function| |-|-|-| |GGML_SYCL_DEVICE|0 (default) or 1|Set the device id used. Check the device ids by default running output| |GGML_SYCL_DEBUG|0 (default) or 1|Enable log function by macro: GGML_SYCL_DEBUG|

Known Issue

  • Hang during startup

llama.cpp use mmap as default way to read model file and copy to GPU. In some system, memcpy will be abnormal and block.

Solution: add --no-mmap.

Q&A

  • Error: error while loading shared libraries: libsycl.so.7: cannot open shared object file: No such file or directory.

Miss to enable oneAPI running environment.

Install oneAPI base toolkit and enable it by: source /opt/intel/oneapi/setvars.sh.

  • In Windows, no result, not error.

Miss to enable oneAPI running environment.

Todo

  • Support to build in Windows.

  • Support multiple cards.