Bez popisu

2029 Revize

6 Větve

Yiming Cui d62520eb2c Fix typos of IQ2_XXS and IQ3_XXS in llama.cpp (#5231)		před 1 rokem
.devops	39baaf55a1 docker : add server-first container images (#5157)	před 2 roky
.github	01684139c3 support SYCL backend windows build (#5208)	před 1 rokem
awq-py	d9aa4ffa6e awq-py : fix typo in awq-py/README.md (#4947)	před 2 roky
ci	0f648573dd ggml : add unified SYCL backend for Intel GPUs (#2690)	před 2 roky
cmake	c41ea36eaa cmake : MSVC instruction detection (fixed up #809) (#3923)	před 2 roky
common	e8dc55d006 kompute : llama-bench support and ggml_cpu_has_kompute() (#5226)	před 1 rokem
docs	ff8238f71d docs : add llama-star arch idea	před 2 roky
examples	01684139c3 support SYCL backend windows build (#5208)	před 1 rokem
gguf-py	f2e69d28c0 llama : add support for Orion-14B (#5118)	před 2 roky
grammars	532dd74e38 Fix some documentation typos/grammar mistakes (#4032)	před 2 roky
kompute @ 4565194ed7	fbf1ddec69 Nomic Vulkan backend (#4456)	před 1 rokem
kompute-shaders	fbf1ddec69 Nomic Vulkan backend (#4456)	před 1 rokem
media	62b3e81aae media : add logos and banners	před 2 roky
models	ea5497df5d gpt2 : Add gpt2 architecture integration (#4555)	před 2 roky
pocs	35a2ee9143 Remove unused data and add fixes (#5154)	před 2 roky
prompts	37c746d687 llama : add Qwen support (#4281)	před 2 roky
requirements	04ac0607e9 python : add check-requirements.sh and GitHub workflow (#4585)	před 2 roky
scripts	01684139c3 support SYCL backend windows build (#5208)	před 1 rokem
spm-headers	3ba5b8ca8e swift : pin ggml commit + remove ggml.h from spm-headers (#4878)	před 2 roky
tests	625a699b54 `ggml_cuda_cpy` support for 4d tensors and float16->float32 upcasting (ggml/686)	před 1 rokem
.clang-tidy	00d62adb79 fix some warnings from gcc and clang-tidy (#3038)	před 2 roky
.dockerignore	ea55295a74 docker : ignore Git files (#3314)	před 2 roky
.ecrc	fbf1ddec69 Nomic Vulkan backend (#4456)	před 1 rokem
.editorconfig	800a489e4a llama.swiftui : add bench functionality (#4483)	před 2 roky
.flake8	5ddf7ea1fb hooks : setting up flake8 and pre-commit hooks (#1681)	před 2 roky
.gitignore	01684139c3 support SYCL backend windows build (#5208)	před 1 rokem
.gitmodules	fbf1ddec69 Nomic Vulkan backend (#4456)	před 1 rokem
.pre-commit-config.yaml	5ddf7ea1fb hooks : setting up flake8 and pre-commit hooks (#1681)	před 2 roky
CMakeLists.txt	01684139c3 support SYCL backend windows build (#5208)	před 1 rokem
LICENSE	6a9a67f0be Add LICENSE (#21)	před 2 roky
Makefile	2307523d32 ggml : add Vulkan backend (#2059)	před 2 roky
Package.swift	b037787548 swift : track ggml release branch (#4867)	před 2 roky
README-sycl.md	01684139c3 support SYCL backend windows build (#5208)	před 1 rokem
README.md	01684139c3 support SYCL backend windows build (#5208)	před 1 rokem
SHA256SUMS	31d2b5f4a4 Update SHA256SUMS with current hashes for models quantized using q4_0 (#1798)	před 2 roky
build.zig	b12fa0d1c1 build : link against build info instead of compiling against it (#3879)	před 2 roky
codecov.yml	73a12a6344 cov : disable comment in PRs (#2989)	před 2 roky
convert-hf-to-gguf.py	f2e69d28c0 llama : add support for Orion-14B (#5118)	před 2 roky
convert-llama-ggml-to-gguf.py	b43ebde3b0 convert : partially revert PR #4818 (#5041)	před 2 roky
convert-lora-to-ggml.py	05490fad7f add safetensors support to convert-lora-to-ggml.py (#5062)	před 2 roky
convert-persimmon-to-gguf.py	b43ebde3b0 convert : partially revert PR #4818 (#5041)	před 2 roky
convert.py	14fef85e2d py : fix except (#5194)	před 1 rokem
flake.lock	b764b8f1d0 flake.lock: Update (#5162)	před 2 roky
flake.nix	b2d80e105a flake.nix: add a comment about flakes vs nix	před 2 roky
ggml-alloc.c	ceebbb5b21 ggml alloc: Fix for null dereference on alloc failure (#5200)	před 1 rokem
ggml-alloc.h	e7e4df031b llama : ggml-backend integration (#4766)	před 2 roky
ggml-backend-impl.h	2307523d32 ggml : add Vulkan backend (#2059)	před 2 roky
ggml-backend.c	fbf1ddec69 Nomic Vulkan backend (#4456)	před 1 rokem
ggml-backend.h	2307523d32 ggml : add Vulkan backend (#2059)	před 2 roky
ggml-cuda.cu	8f8ddfcfad sync : ggml (#0)	před 1 rokem
ggml-cuda.h	a0b3ac8c48 ggml : introduce GGML_CALL function annotation (#4850)	před 2 roky
ggml-impl.h	e7e4df031b llama : ggml-backend integration (#4766)	před 2 roky
ggml-kompute.cpp	fbf1ddec69 Nomic Vulkan backend (#4456)	před 1 rokem
ggml-kompute.h	fbf1ddec69 Nomic Vulkan backend (#4456)	před 1 rokem
ggml-metal.h	5f14ee0b0c metal : add debug capture backend function (ggml/694)	před 1 rokem
ggml-metal.m	549a1e6cd5 ci : fix yolo URLs + fix metal capture (ggml/712)	před 1 rokem
ggml-metal.metal	fea4fd4ba7 ggml : fix IQ3_XXS on Metal (#5219)	před 1 rokem
ggml-mpi.c	5bf2a27718 ggml : remove src0 and src1 from ggml_tensor and rename opt to src (#2178)	před 2 roky
ggml-mpi.h	5656d10599 mpi : add support for distributed inference via MPI (#2099)	před 2 roky
ggml-opencl.cpp	fbe7dfa53c ggml : add max buffer sizes to opencl and metal backends (#5181)	před 1 rokem
ggml-opencl.h	a1d6df129b Add OpenCL add kernel (#5151)	před 2 roky
ggml-quants.c	8e14e3ddb3 Faster AVX2 dot product for IQ2_XS (#5187)	před 1 rokem
ggml-quants.h	f4d7e54974 SOTA 3-bit quants (#5196)	před 1 rokem
ggml-sycl.cpp	2307523d32 ggml : add Vulkan backend (#2059)	před 2 roky
ggml-sycl.h	0f648573dd ggml : add unified SYCL backend for Intel GPUs (#2690)	před 2 roky
ggml-vulkan-shaders.hpp	2307523d32 ggml : add Vulkan backend (#2059)	před 2 roky
ggml-vulkan.cpp	2256f36b79 Vulkan Windows APU Memory Handling (#5199)	před 1 rokem
ggml-vulkan.h	2307523d32 ggml : add Vulkan backend (#2059)	před 2 roky
ggml.c	e8dc55d006 kompute : llama-bench support and ggml_cpu_has_kompute() (#5226)	před 1 rokem
ggml.h	e8dc55d006 kompute : llama-bench support and ggml_cpu_has_kompute() (#5226)	před 1 rokem
ggml_vk_generate_shaders.py	2307523d32 ggml : add Vulkan backend (#2059)	před 2 roky
llama.cpp	d62520eb2c Fix typos of IQ2_XXS and IQ3_XXS in llama.cpp (#5231)	před 1 rokem
llama.h	f4d7e54974 SOTA 3-bit quants (#5196)	před 1 rokem
mypy.ini	b43ebde3b0 convert : partially revert PR #4818 (#5041)	před 2 roky
requirements.txt	04ac0607e9 python : add check-requirements.sh and GitHub workflow (#4585)	před 2 roky
unicode.h	6c5629d4d2 add `#include <string>` to unicode.h (#5051)	před 2 roky

llama.cpp for SYCL

Background

SYCL is a higher-level programming model to improve programming productivity on various hardware accelerators—such as CPUs, GPUs, and FPGAs. It is a single-source embedded domain-specific language based on pure C++17.

oneAPI is a specification that is open and standards-based, supporting multiple architecture types including but not limited to GPU, CPU, and FPGA. The spec has both direct programming and API-based programming paradigms.

Intel uses the SYCL as direct programming language to support CPU, GPUs and FPGAs.

To avoid to re-invent the wheel, this code refer other code paths in llama.cpp (like OpenBLAS, cuBLAS, CLBlast). We use a open-source tool SYCLomatic (Commercial release Intel® DPC++ Compatibility Tool) migrate to SYCL.

The llama.cpp for SYCL is used to support Intel GPUs.

For Intel CPU, recommend to use llama.cpp for X86 (Intel MKL building).

OS

Intel GPU

Linux

Setup Environment

Install Intel GPU driver.

a. Please install Intel GPU driver by official guide: Install GPU Drivers.

Note: for iGPU, please install the client GPU driver.

b. Add user to group: video, render.

sudo usermod -aG render username
sudo usermod -aG video username

Note: re-login to enable it.

c. Check

sudo apt install clinfo
sudo clinfo -l

Output (example):

Platform #0: Intel(R) OpenCL Graphics
 `-- Device #0: Intel(R) Arc(TM) A770 Graphics


Platform #0: Intel(R) OpenCL HD Graphics
 `-- Device #0: Intel(R) Iris(R) Xe Graphics [0x9a49]

Install Intel® oneAPI Base toolkit.

a. Please follow the procedure in Get the Intel® oneAPI Base Toolkit .

Recommend to install to default folder: /opt/intel/oneapi.

Following guide use the default folder as example. If you use other folder, please modify the following guide info with your folder.

b. Check

source /opt/intel/oneapi/setvars.sh

sycl-ls

There should be one or more level-zero devices. Like [ext_oneapi_level_zero:gpu:0].

Output (example):

[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.10.0.17_160000]
[opencl:cpu:1] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i7-13700K OpenCL 3.0 (Build 0) [2023.16.10.0.17_160000]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics OpenCL 3.0 NEO  [23.30.26918.50]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26918]

Build locally:

mkdir -p build
cd build
source /opt/intel/oneapi/setvars.sh

#for FP16
#cmake .. -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DLLAMA_SYCL_F16=ON # faster for long-prompt inference

#for FP32
cmake .. -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx

#build example/main only
#cmake --build . --config Release --target main

#build all binary
cmake --build . --config Release -v

cd ..

./examples/sycl/build.sh

Note:

By default, it will build for all binary files. It will take more time. To reduce the time, we recommend to build for example/main only.

Run

Put model file to folder models
Enable oneAPI running environment
```
source /opt/intel/oneapi/setvars.sh
```
List device ID

Run without parameter:

./build/bin/ls-sycl-device

or

./build/bin/main

Check the ID in startup log, like:

found 4 SYCL devices:
  Device 0: Intel(R) Arc(TM) A770 Graphics,	compute capability 1.3,
    max compute_units 512,	max work group size 1024,	max sub group size 32,	global mem size 16225243136
  Device 1: Intel(R) FPGA Emulation Device,	compute capability 1.2,
    max compute_units 24,	max work group size 67108864,	max sub group size 64,	global mem size 67065057280
  Device 2: 13th Gen Intel(R) Core(TM) i7-13700K,	compute capability 3.0,
    max compute_units 24,	max work group size 8192,	max sub group size 64,	global mem size 67065057280
  Device 3: Intel(R) Arc(TM) A770 Graphics,	compute capability 3.0,
    max compute_units 512,	max work group size 1024,	max sub group size 32,	global mem size 16225243136

Set device ID and execute llama.cpp

Set device ID = 0 by GGML_SYCL_DEVICE=0

GGML_SYCL_DEVICE=0 ./build/bin/main -m models/llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33

or run by script:

./examples/sycl/run-llama2.sh

Note:

By default, mmap is used to read model file. In some cases, it leads to the hang issue. Recommend to use parameter --no-mmap to disable mmap() to skip this issue.

Check the device ID in output

Like:

Using device **0** (Intel(R) Arc(TM) A770 Graphics) as main device

Windows

Setup Environment

Install Intel GPU driver.

Please install Intel GPU driver by official guide: Install GPU Drivers.

Install Intel® oneAPI Base toolkit.

a. Please follow the procedure in Get the Intel® oneAPI Base Toolkit .

Recommend to install to default folder: /opt/intel/oneapi.

Following guide uses the default folder as example. If you use other folder, please modify the following guide info with your folder.

b. Enable oneAPI running environment:

In Search, input 'oneAPI'.

Search & open "Intel oneAPI command prompt for Intel 64 for Visual Studio 2022"

In Run:

In CMD:

"C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64

c. Check GPU

In oneAPI command line:

sycl-ls

There should be one or more level-zero devices. Like [ext_oneapi_level_zero:gpu:0].

Output (example):

[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.10.0.17_160000]
[opencl:cpu:1] Intel(R) OpenCL, 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz OpenCL 3.0 (Build 0) [2023.16.10.0.17_160000]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Iris(R) Xe Graphics OpenCL 3.0 NEO  [31.0.101.5186]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Iris(R) Xe Graphics 1.3 [1.3.28044]

Install cmake & make

a. Download & install cmake for windows: https://cmake.org/download/

b. Download & install make for windows provided by mingw-w64: https://www.mingw-w64.org/downloads/

Build locally:

In oneAPI command line window:

mkdir -p build
cd build
@call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64 --force

::  for FP16
::  faster for long-prompt inference
::  cmake -G "MinGW Makefiles" ..  -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icx  -DCMAKE_BUILD_TYPE=Release -DLLAMA_SYCL_F16=ON

::  for FP32
cmake -G "MinGW Makefiles" ..  -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icx  -DCMAKE_BUILD_TYPE=Release


::  build example/main only
::  make main

::  build all binary
make -j
cd ..

.\examples\sycl\win-build-sycl.bat

Note:

By default, it will build for all binary files. It will take more time. To reduce the time, we recommend to build for example/main only.

Run

Put model file to folder models
Enable oneAPI running environment

In Search, input 'oneAPI'.

Search & open "Intel oneAPI command prompt for Intel 64 for Visual Studio 2022"

In Run:

In CMD:

"C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64

List device ID

Run without parameter:

build\bin\ls-sycl-device.exe

or

build\bin\main.exe

Check the ID in startup log, like:

found 4 SYCL devices:
  Device 0: Intel(R) Arc(TM) A770 Graphics,	compute capability 1.3,
    max compute_units 512,	max work group size 1024,	max sub group size 32,	global mem size 16225243136
  Device 1: Intel(R) FPGA Emulation Device,	compute capability 1.2,
    max compute_units 24,	max work group size 67108864,	max sub group size 64,	global mem size 67065057280
  Device 2: 13th Gen Intel(R) Core(TM) i7-13700K,	compute capability 3.0,
    max compute_units 24,	max work group size 8192,	max sub group size 64,	global mem size 67065057280
  Device 3: Intel(R) Arc(TM) A770 Graphics,	compute capability 3.0,
    max compute_units 512,	max work group size 1024,	max sub group size 32,	global mem size 16225243136

Set device ID and execute llama.cpp

Set device ID = 0 by set GGML_SYCL_DEVICE=0

set GGML_SYCL_DEVICE=0
build\bin\main.exe -m models\llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e -ngl 33 -s 0

or run by script:

.\examples\sycl\win-run-llama2.bat

Note:

By default, mmap is used to read model file. In some cases, it leads to the hang issue. Recommend to use parameter --no-mmap to disable mmap() to skip this issue.

Check the device ID in output

Like:

Using device **0** (Intel(R) Arc(TM) A770 Graphics) as main device

Environment Variable

Build

Running

Known Issue

Hang during startup

llama.cpp use mmap as default way to read model file and copy to GPU. In some system, memcpy will be abnormal and block.

Solution: add --no-mmap.

Q&A

Error: error while loading shared libraries: libsycl.so.7: cannot open shared object file: No such file or directory.

Miss to enable oneAPI running environment.

Install oneAPI base toolkit and enable it by: source /opt/intel/oneapi/setvars.sh.

In Windows, no result, not error.

Miss to enable oneAPI running environment.

Todo

Support to build in Windows.
Support multiple cards.

README-sycl.md

llama.cpp for SYCL

Background

OS

Intel GPU

Linux

Setup Environment

Run

Windows

Setup Environment

Build locally:

Run

Environment Variable

Build

Running

Known Issue

Q&A

Todo