Тайлбар байхгүй

Artem c4d7f81786 readme : update ui list (#5731)		1 жил өмнө
.devops	201294ae17 nix: init singularity and docker images (#5056)	1 жил өмнө
.github	4804215cb8 server: CI fix trailing space (#5728)	1 жил өмнө
awq-py	d9aa4ffa6e awq-py : fix typo in awq-py/README.md (#4947)	2 жил өмнө
ci	b1de96824b ci : fix wikitext url + compile warnings (#5569)	1 жил өмнө
cmake	c41ea36eaa cmake : MSVC instruction detection (fixed up #809) (#3923)	2 жил өмнө
common	e3965cf35a server: tests - slow inference causes timeout on the CI (#5715)	1 жил өмнө
docs	ff8238f71d docs : add llama-star arch idea	2 жил өмнө
examples	e3965cf35a server: tests - slow inference causes timeout on the CI (#5715)	1 жил өмнө
gguf-py	580111d42b llama : add `gemma` model (#5631)	1 жил өмнө
grammars	532dd74e38 Fix some documentation typos/grammar mistakes (#4032)	2 жил өмнө
kompute @ 4565194ed7	fbf1ddec69 Nomic Vulkan backend (#4456)	1 жил өмнө
kompute-shaders	fbf1ddec69 Nomic Vulkan backend (#4456)	1 жил өмнө
media	62b3e81aae media : add logos and banners	2 жил өмнө
models	ea5497df5d gpt2 : Add gpt2 architecture integration (#4555)	2 жил өмнө
pocs	a07d0fee1f ggml : add mmla kernels for quantized GEMM (#4966)	1 жил өмнө
prompts	37c746d687 llama : add Qwen support (#4281)	2 жил өмнө
requirements	04ac0607e9 python : add check-requirements.sh and GitHub workflow (#4585)	2 жил өмнө
scripts	334f76fa38 sync : ggml	1 жил өмнө
spm-headers	df334a1125 swift : package no longer use ggml dependency (#5465)	1 жил өмнө
tests	ab336a9d5e code : normalize enum names (#5697)	1 жил өмнө
.clang-tidy	00d62adb79 fix some warnings from gcc and clang-tidy (#3038)	2 жил өмнө
.dockerignore	ea55295a74 docker : ignore Git files (#3314)	2 жил өмнө
.ecrc	fbf1ddec69 Nomic Vulkan backend (#4456)	1 жил өмнө
.editorconfig	800a489e4a llama.swiftui : add bench functionality (#4483)	2 жил өмнө
.flake8	2891c8aa9a Add support for BERT embedding models (#5423)	1 жил өмнө
.gitignore	d250c9d61d gitignore : update for CLion IDE (#5544)	1 жил өмнө
.gitmodules	fbf1ddec69 Nomic Vulkan backend (#4456)	1 жил өмнө
.pre-commit-config.yaml	5ddf7ea1fb hooks : setting up flake8 and pre-commit hooks (#1681)	2 жил өмнө
CMakeLists.txt	1289408817 cmake : fix compilation for Android armeabi-v7a (#5702)	1 жил өмнө
LICENSE	6a9a67f0be Add LICENSE (#21)	2 жил өмнө
Makefile	f1a98c5254 make : fix nvcc version is empty (#5713)	1 жил өмнө
Package.swift	df334a1125 swift : package no longer use ggml dependency (#5465)	1 жил өмнө
README-sycl.md	70d45af0ef readme : fix typo in README-sycl.md (#5353)	1 жил өмнө
README.md	c4d7f81786 readme : update ui list (#5731)	1 жил өмнө
build.zig	6560bed3f0 server : support llava 1.6 (#5553)	1 жил өмнө
codecov.yml	73a12a6344 cov : disable comment in PRs (#2989)	2 жил өмнө
convert-hf-to-gguf.py	69917dfa55 py : fix StableLM conversion after config.json changes (#5703)	1 жил өмнө
convert-llama-ggml-to-gguf.py	b43ebde3b0 convert : partially revert PR #4818 (#5041)	2 жил өмнө
convert-lora-to-ggml.py	05490fad7f add safetensors support to convert-lora-to-ggml.py (#5062)	2 жил өмнө
convert-persimmon-to-gguf.py	dbd8828eb0 py : fix persimmon `n_rot` conversion (#5460)	1 жил өмнө
convert.py	aa23412989 llava : support v1.6 (#5267)	1 жил өмнө
flake.lock	c393733988 flake.lock: Update	1 жил өмнө
flake.nix	633782b8d9 nix: now that we can do so, allow MacOS to build Vulkan binaries	1 жил өмнө
ggml-alloc.c	a3145bdc30 ggml-alloc : apply ggml/731	1 жил өмнө
ggml-alloc.h	3b169441df sync : ggml (#5452)	1 жил өмнө
ggml-backend-impl.h	2307523d32 ggml : add Vulkan backend (#2059)	1 жил өмнө
ggml-backend.c	bd2d4e393b 1.5 bit quantization (#5453)	1 жил өмнө
ggml-backend.h	3b169441df sync : ggml (#5452)	1 жил өмнө
ggml-cuda.cu	ab336a9d5e code : normalize enum names (#5697)	1 жил өмнө
ggml-cuda.h	a0b3ac8c48 ggml : introduce GGML_CALL function annotation (#4850)	2 жил өмнө
ggml-impl.h	7e4f339c40 ggml : always define ggml_fp16_t as uint16_t (#5666)	1 жил өмнө
ggml-kompute.cpp	fbf1ddec69 Nomic Vulkan backend (#4456)	1 жил өмнө
ggml-kompute.h	fbf1ddec69 Nomic Vulkan backend (#4456)	1 жил өмнө
ggml-metal.h	5f14ee0b0c metal : add debug capture backend function (ggml/694)	1 жил өмнө
ggml-metal.m	ab336a9d5e code : normalize enum names (#5697)	1 жил өмнө
ggml-metal.metal	4c4cb30736 IQ3_S: a much better alternative to Q3_K (#5676)	1 жил өмнө
ggml-mpi.c	5bf2a27718 ggml : remove src0 and src1 from ggml_tensor and rename opt to src (#2178)	2 жил өмнө
ggml-mpi.h	5656d10599 mpi : add support for distributed inference via MPI (#2099)	2 жил өмнө
ggml-opencl.cpp	ab336a9d5e code : normalize enum names (#5697)	1 жил өмнө
ggml-opencl.h	a1d6df129b Add OpenCL add kernel (#5151)	2 жил өмнө
ggml-quants.c	abbabc5e51 ggml-quants : provide ggml_vqtbl1q_u8 for 64bit compatibility (#5711)	1 жил өмнө
ggml-quants.h	4c4cb30736 IQ3_S: a much better alternative to Q3_K (#5676)	1 жил өмнө
ggml-sycl.cpp	e849078c6e [SYCL] Add support for soft_max ALiBi (#5639)	1 жил өмнө
ggml-sycl.h	128dcbd3c9 add --no-mmap in llama-bench (#5257)	1 жил өмнө
ggml-vulkan-shaders.hpp	e920ed393d Vulkan Intel Fixes, Optimizations and Debugging Flags (#5301)	1 жил өмнө
ggml-vulkan.cpp	ab336a9d5e code : normalize enum names (#5697)	1 жил өмнө
ggml-vulkan.h	ee1628bdfe Basic Vulkan Multi-GPU implementation (#5321)	1 жил өмнө
ggml.c	ab336a9d5e code : normalize enum names (#5697)	1 жил өмнө
ggml.h	ab336a9d5e code : normalize enum names (#5697)	1 жил өмнө
ggml_vk_generate_shaders.py	4b7b38bef5 vulkan: Set limit for task concurrency (#5427)	1 жил өмнө
llama.cpp	e849078c6e [SYCL] Add support for soft_max ALiBi (#5639)	1 жил өмнө
llama.h	bf08e00643 llama : refactor k-shift implementation + KV defragmentation (#5691)	1 жил өмнө
mypy.ini	b43ebde3b0 convert : partially revert PR #4818 (#5041)	2 жил өмнө
requirements.txt	04ac0607e9 python : add check-requirements.sh and GitHub workflow (#4585)	2 жил өмнө
unicode.h	67fd33132f unicode : reuse iterator (#5726)	1 жил өмнө

llama.cpp for SYCL

Background
OS
Intel GPU
Docker
Linux
Windows
Environment Variable
Known Issue
Q&A
Todo

Background

SYCL is a higher-level programming model to improve programming productivity on various hardware accelerators—such as CPUs, GPUs, and FPGAs. It is a single-source embedded domain-specific language based on pure C++17.

oneAPI is a specification that is open and standards-based, supporting multiple architecture types including but not limited to GPU, CPU, and FPGA. The spec has both direct programming and API-based programming paradigms.

Intel uses the SYCL as direct programming language to support CPU, GPUs and FPGAs.

To avoid to re-invent the wheel, this code refer other code paths in llama.cpp (like OpenBLAS, cuBLAS, CLBlast). We use a open-source tool SYCLomatic (Commercial release Intel® DPC++ Compatibility Tool) migrate to SYCL.

The llama.cpp for SYCL is used to support Intel GPUs.

For Intel CPU, recommend to use llama.cpp for X86 (Intel MKL building).

OS

Intel GPU

Verified

Note: If the EUs (Execution Unit) in iGPU is less than 80, the inference speed will be too slow to use.

Memory

The memory is a limitation to run LLM on GPUs.

When run llama.cpp, there is print log to show the applied memory on GPU. You could know how much memory to be used in your case. Like llm_load_tensors: buffer size = 3577.56 MiB.

For iGPU, please make sure the shared memory from host memory is enough. For llama-2-7b.Q4_0, recommend the host memory is 8GB+.

For dGPU, please make sure the device memory is enough. For llama-2-7b.Q4_0, recommend the device memory is 4GB+.

Docker

Note:

Only docker on Linux is tested. Docker on WSL may not work.
You may need to install Intel GPU driver on the host machine (See the Linux section to know how to do that)

Build the image

You can choose between F16 and F32 build. F16 is faster for long-prompt inference.

# For F16:
#docker build -t llama-cpp-sycl --build-arg="LLAMA_SYCL_F16=ON" -f .devops/main-intel.Dockerfile .

# Or, for F32:
docker build -t llama-cpp-sycl -f .devops/main-intel.Dockerfile .

# Note: you can also use the ".devops/main-server.Dockerfile", which compiles the "server" example

Run

# Firstly, find all the DRI cards:
ls -la /dev/dri
# Then, pick the card that you want to use.

# For example with "/dev/dri/card1"
docker run -it --rm -v "$(pwd):/app:Z" --device /dev/dri/renderD128:/dev/dri/renderD128 --device /dev/dri/card1:/dev/dri/card1 llama-cpp-sycl -m "/app/models/YOUR_MODEL_FILE" -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33

Linux

Setup Environment

Install Intel GPU driver.

a. Please install Intel GPU driver by official guide: Install GPU Drivers.

Note: for iGPU, please install the client GPU driver.

b. Add user to group: video, render.

sudo usermod -aG render username
sudo usermod -aG video username

Note: re-login to enable it.

c. Check

sudo apt install clinfo
sudo clinfo -l

Output (example):

Platform #0: Intel(R) OpenCL Graphics
 `-- Device #0: Intel(R) Arc(TM) A770 Graphics


Platform #0: Intel(R) OpenCL HD Graphics
 `-- Device #0: Intel(R) Iris(R) Xe Graphics [0x9a49]

Install Intel® oneAPI Base toolkit.

a. Please follow the procedure in Get the Intel® oneAPI Base Toolkit .

Recommend to install to default folder: /opt/intel/oneapi.

Following guide use the default folder as example. If you use other folder, please modify the following guide info with your folder.

b. Check

source /opt/intel/oneapi/setvars.sh

sycl-ls

There should be one or more level-zero devices. Please confirm that at least one GPU is present, like [ext_oneapi_level_zero:gpu:0].

Output (example):

[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.10.0.17_160000]
[opencl:cpu:1] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i7-13700K OpenCL 3.0 (Build 0) [2023.16.10.0.17_160000]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics OpenCL 3.0 NEO  [23.30.26918.50]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26918]

Build locally:

Note:

You can choose between F16 and F32 build. F16 is faster for long-prompt inference.

By default, it will build for all binary files. It will take more time. To reduce the time, we recommend to build for example/main only.

mkdir -p build
cd build
source /opt/intel/oneapi/setvars.sh

# For FP16:
#cmake .. -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DLLAMA_SYCL_F16=ON

# Or, for FP32:
cmake .. -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx

# Build example/main only
#cmake --build . --config Release --target main

# Or, build all binary
cmake --build . --config Release -v

cd ..

./examples/sycl/build.sh

Run

Put model file to folder models

You could download llama-2-7b.Q4_0.gguf as example.

Enable oneAPI running environment
```
source /opt/intel/oneapi/setvars.sh
```
List device ID

Run without parameter:

./build/bin/ls-sycl-device

# or running the "main" executable and look at the output log:

./build/bin/main

Check the ID in startup log, like:

found 4 SYCL devices:
  Device 0: Intel(R) Arc(TM) A770 Graphics,	compute capability 1.3,
    max compute_units 512,	max work group size 1024,	max sub group size 32,	global mem size 16225243136
  Device 1: Intel(R) FPGA Emulation Device,	compute capability 1.2,
    max compute_units 24,	max work group size 67108864,	max sub group size 64,	global mem size 67065057280
  Device 2: 13th Gen Intel(R) Core(TM) i7-13700K,	compute capability 3.0,
    max compute_units 24,	max work group size 8192,	max sub group size 64,	global mem size 67065057280
  Device 3: Intel(R) Arc(TM) A770 Graphics,	compute capability 3.0,
    max compute_units 512,	max work group size 1024,	max sub group size 32,	global mem size 16225243136

Set device ID and execute llama.cpp

Set device ID = 0 by GGML_SYCL_DEVICE=0

GGML_SYCL_DEVICE=0 ./build/bin/main -m models/llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33

or run by script:

./examples/sycl/run_llama2.sh

Note:

By default, mmap is used to read model file. In some cases, it leads to the hang issue. Recommend to use parameter --no-mmap to disable mmap() to skip this issue.

Check the device ID in output

Like:

Using device **0** (Intel(R) Arc(TM) A770 Graphics) as main device

Windows

Setup Environment

Install Intel GPU driver.

Please install Intel GPU driver by official guide: Install GPU Drivers.

Note: The driver is mandatory for compute function.

Install Visual Studio.

Please install Visual Studio which impact oneAPI environment enabling in Windows.

Install Intel® oneAPI Base toolkit.

a. Please follow the procedure in Get the Intel® oneAPI Base Toolkit .

Recommend to install to default folder: C:\Program Files (x86)\Intel\oneAPI.

Following guide uses the default folder as example. If you use other folder, please modify the following guide info with your folder.

b. Enable oneAPI running environment:

In Search, input 'oneAPI'.

Search & open "Intel oneAPI command prompt for Intel 64 for Visual Studio 2022"

In Run:

In CMD:

"C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64

c. Check GPU

In oneAPI command line:

sycl-ls

There should be one or more level-zero devices. Please confirm that at least one GPU is present, like [ext_oneapi_level_zero:gpu:0].

Output (example):

[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.10.0.17_160000]
[opencl:cpu:1] Intel(R) OpenCL, 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz OpenCL 3.0 (Build 0) [2023.16.10.0.17_160000]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Iris(R) Xe Graphics OpenCL 3.0 NEO  [31.0.101.5186]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Iris(R) Xe Graphics 1.3 [1.3.28044]

Install cmake & make

a. Download & install cmake for Windows: https://cmake.org/download/

b. Download & install mingw-w64 make for Windows provided by w64devkit

Download the latest fortran version of w64devkit.
Extract w64devkit on your pc.
Add the bin folder path in the Windows system PATH environment, like C:\xxx\w64devkit\bin\.

Build locally:

In oneAPI command line window:

mkdir -p build
cd build
@call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64 --force

::  for FP16
::  faster for long-prompt inference
::  cmake -G "MinGW Makefiles" ..  -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icx  -DCMAKE_BUILD_TYPE=Release -DLLAMA_SYCL_F16=ON

::  for FP32
cmake -G "MinGW Makefiles" ..  -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icx  -DCMAKE_BUILD_TYPE=Release


::  build example/main only
::  make main

::  build all binary
make -j
cd ..

.\examples\sycl\win-build-sycl.bat

Note:

By default, it will build for all binary files. It will take more time. To reduce the time, we recommend to build for example/main only.

Run

Put model file to folder models

You could download llama-2-7b.Q4_0.gguf as example.

Enable oneAPI running environment

In Search, input 'oneAPI'.

Search & open "Intel oneAPI command prompt for Intel 64 for Visual Studio 2022"

In Run:

In CMD:

"C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64

List device ID

Run without parameter:

build\bin\ls-sycl-device.exe

or

build\bin\main.exe

Check the ID in startup log, like:

found 4 SYCL devices:
  Device 0: Intel(R) Arc(TM) A770 Graphics,	compute capability 1.3,
    max compute_units 512,	max work group size 1024,	max sub group size 32,	global mem size 16225243136
  Device 1: Intel(R) FPGA Emulation Device,	compute capability 1.2,
    max compute_units 24,	max work group size 67108864,	max sub group size 64,	global mem size 67065057280
  Device 2: 13th Gen Intel(R) Core(TM) i7-13700K,	compute capability 3.0,
    max compute_units 24,	max work group size 8192,	max sub group size 64,	global mem size 67065057280
  Device 3: Intel(R) Arc(TM) A770 Graphics,	compute capability 3.0,
    max compute_units 512,	max work group size 1024,	max sub group size 32,	global mem size 16225243136

Set device ID and execute llama.cpp

Set device ID = 0 by set GGML_SYCL_DEVICE=0

set GGML_SYCL_DEVICE=0
build\bin\main.exe -m models\llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e -ngl 33 -s 0

or run by script:

.\examples\sycl\win-run-llama2.bat

Note:

By default, mmap is used to read model file. In some cases, it leads to the hang issue. Recommend to use parameter --no-mmap to disable mmap() to skip this issue.

Check the device ID in output

Like:

Using device **0** (Intel(R) Arc(TM) A770 Graphics) as main device

Environment Variable

Build

Running

Known Issue

Hang during startup

llama.cpp use mmap as default way to read model file and copy to GPU. In some system, memcpy will be abnormal and block.

Solution: add --no-mmap or --mmap 0.

Q&A

Error: error while loading shared libraries: libsycl.so.7: cannot open shared object file: No such file or directory.

Miss to enable oneAPI running environment.

Install oneAPI base toolkit and enable it by: source /opt/intel/oneapi/setvars.sh.

In Windows, no result, not error.

Miss to enable oneAPI running environment.

Meet compile error.

Remove folder build and try again.

I can not see [ext_oneapi_level_zero:gpu:0] afer install GPU driver in Linux.

Please run sudo sycl-ls.

If you see it in result, please add video/render group to your ID:

  sudo usermod -aG render username
  sudo usermod -aG video username

Then relogin.

If you do not see it, please check the installation GPU steps again.

Todo

Support multiple cards.

README-sycl.md

llama.cpp for SYCL

Background

OS

Intel GPU

Verified

Memory

Docker

Build the image

Run

Linux

Setup Environment

Run

Windows

Setup Environment

Build locally:

Run

Environment Variable

Build

Running

Known Issue

Q&A

Todo