1 жил өмнө · af3ba5d946
--- a/README-sycl.md
+++ b/README-sycl.md
@@ -42,6 +42,8 @@ For Intel CPU, recommend to use llama.cpp for X86 (Intel MKL building).
 
															 ## Intel GPU
														
 
															+### Verified
														
 
															+
														
 
															 |Intel GPU| Status | Verified Model|
														
 
															 |-|-|-|
														
 
															 |Intel Data Center Max Series| Support| Max 1550|
														
@@ -50,6 +52,17 @@ For Intel CPU, recommend to use llama.cpp for X86 (Intel MKL building).
 
															 |Intel built-in Arc GPU| Support| built-in Arc GPU in Meteor Lake|
														
 
															 |Intel iGPU| Support| iGPU in i5-1250P, i7-1165G7|
														
 
															+Note: If the EUs (Execution Unit) in iGPU is less than 80, the inference speed will be too slow to use.
														
 
															+
														
 
															+### Memory
														
 
															+
														
 
															+The memory is a limitation to run LLM on GPUs.
														
 
															+
														
 
															+When run llama.cpp, there is print log to show the applied memory on GPU. You could know how much memory to be used in your case. Like `llm_load_tensors:            buffer size =  3577.56 MiB`.
														
 
															+
														
 
															+For iGPU, please make sure the shared memory from host memory is enough. For llama-2-7b.Q4_0, recommend the host memory is 8GB+.
														
 
															+
														
 
															+For dGPU, please make sure the device memory is enough. For llama-2-7b.Q4_0, recommend the device memory is 4GB+.
														
 
															 ## Linux
														
@@ -105,7 +118,7 @@ source /opt/intel/oneapi/setvars.sh
 
															 sycl-ls
														
 
															 ```
														
 
															-There should be one or more level-zero devices. Like **[ext_oneapi_level_zero:gpu:0]**.
														
 
															+There should be one or more level-zero devices. Please confirm that at least one GPU is present, like **[ext_oneapi_level_zero:gpu:0]**.
														
 
															 Output (example):
														
 
															 ```
														
@@ -152,6 +165,8 @@ Note:
 
															 1. Put model file to folder **models**
														
 
															+You could download [llama-2-7b.Q4_0.gguf](https://huggingface.co/TheBloke/Llama-2-7B-GGUF/blob/main/llama-2-7b.Q4_0.gguf) as example.
														
 
															+
														
 
															 2. Enable oneAPI running environment
														
 
															 ```
														
@@ -223,7 +238,13 @@ Using device **0** (Intel(R) Arc(TM) A770 Graphics) as main device
 
															 Please install Intel GPU driver by official guide: [Install GPU Drivers](https://www.intel.com/content/www/us/en/products/docs/discrete-gpus/arc/software/drivers.html).
														
 
															-2. Install Intel® oneAPI Base toolkit.
														
 
															+Note: **The driver is mandatory for compute function**.
														
 
															+
														
 
															+2. Install Visual Studio.
														
 
															+
														
 
															+Please install [Visual Studio](https://visualstudio.microsoft.com/) which impact oneAPI environment enabling in Windows.
														
 
															+
														
 
															+3. Install Intel® oneAPI Base toolkit.
														
 
															 a. Please follow the procedure in [Get the Intel® oneAPI Base Toolkit ](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html).
														
@@ -252,7 +273,7 @@ In oneAPI command line:
 
															 sycl-ls
														
 
															 ```
														
 
															-There should be one or more level-zero devices. Like **[ext_oneapi_level_zero:gpu:0]**.
														
 
															+There should be one or more level-zero devices. Please confirm that at least one GPU is present, like **[ext_oneapi_level_zero:gpu:0]**.
														
 
															 Output (example):
														
 
															 ```
														
@@ -260,15 +281,21 @@ Output (example):
 
															 [opencl:cpu:1] Intel(R) OpenCL, 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz OpenCL 3.0 (Build 0) [2023.16.10.0.17_160000]
														
 
															 [opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Iris(R) Xe Graphics OpenCL 3.0 NEO  [31.0.101.5186]
														
 
															 [ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Iris(R) Xe Graphics 1.3 [1.3.28044]
														
 
															-
														
 
															 ```
														
 
															-3. Install cmake & make
														
 
															+4. Install cmake & make
														
 
															+
														
 
															+a. Download & install cmake for Windows: https://cmake.org/download/
														
 
															-a. Download & install cmake for windows: https://cmake.org/download/
														
 
															+b. Download & install make for Windows provided by mingw-w64
														
 
															-b. Download & install make for windows provided by mingw-w64: https://www.mingw-w64.org/downloads/
														
 
															+- Download binary package for Windows in https://github.com/niXman/mingw-builds-binaries/releases.
														
 
															+  Like [x86_64-13.2.0-release-win32-seh-msvcrt-rt_v11-rev1.7z](https://github.com/niXman/mingw-builds-binaries/releases/download/13.2.0-rt_v11-rev1/x86_64-13.2.0-release-win32-seh-msvcrt-rt_v11-rev1.7z).
														
 
															+
														
 
															+- Unzip the binary package. In the **bin** sub-folder and rename **xxx-make.exe** to **make.exe**.
														
 
															+
														
 
															+- Add the **bin** folder path in the Windows system PATH environment.
														
 
															 ### Build locally:
														
@@ -309,6 +336,8 @@ Note:
 
															 1. Put model file to folder **models**
														
 
															+You could download [llama-2-7b.Q4_0.gguf](https://huggingface.co/TheBloke/Llama-2-7B-GGUF/blob/main/llama-2-7b.Q4_0.gguf) as example.
														
 
															+
														
 
															 2. Enable oneAPI running environment
														
 
															 - In Search, input 'oneAPI'.
														
@@ -419,8 +448,25 @@ Using device **0** (Intel(R) Arc(TM) A770 Graphics) as main device
 
															   Miss to enable oneAPI running environment.
														
 
															-## Todo
														
 
															+- Meet compile error.
														
 
															+
														
 
															+  Remove folder **build** and try again.
														
 
															+
														
 
															+- I can **not** see **[ext_oneapi_level_zero:gpu:0]** afer install GPU driver in Linux.
														
 
															-- Support to build in Windows.
														
 
															+  Please run **sudo sycl-ls**.
														
 
															+
														
 
															+  If you see it in result, please add video/render group to your ID:
														
 
															+
														
 
															+  ```
														
 
															+  sudo usermod -aG render username
														
 
															+  sudo usermod -aG video username
														
 
															+  ```
														
 
															+
														
 
															+  Then **relogin**.
														
 
															+
														
 
															+  If you do not see it, please check the installation GPU steps again.
														
 
															+
														
 
															+## Todo
														
 
															 - Support multiple cards.
														
--- a/examples/llama-bench/README.md
+++ b/examples/llama-bench/README.md
@@ -23,19 +23,23 @@ usage: ./llama-bench [options]
 
															 options:
														
 
															   -h, --help
														
 
															-  -m, --model <filename>            (default: models/7B/ggml-model-q4_0.gguf)
														
 
															-  -p, --n-prompt <n>                (default: 512)
														
 
															-  -n, --n-gen <n>                   (default: 128)
														
 
															-  -b, --batch-size <n>              (default: 512)
														
 
															-  --memory-f32 <0|1>                (default: 0)
														
 
															-  -t, --threads <n>                 (default: 16)
														
 
															-  -ngl N, --n-gpu-layers <n>        (default: 99)
														
 
															-  -mg i, --main-gpu <i>             (default: 0)
														
 
															-  -mmq, --mul-mat-q <0|1>           (default: 1)
														
 
															-  -ts, --tensor_split <ts0/ts1/..>
														
 
															-  -r, --repetitions <n>             (default: 5)
														
 
															-  -o, --output <csv|json|md|sql>    (default: md)
														
 
															-  -v, --verbose                     (default: 0)
														
 
															+  -m, --model <filename>              (default: models/7B/ggml-model-q4_0.gguf)
														
 
															+  -p, --n-prompt <n>                  (default: 512)
														
 
															+  -n, --n-gen <n>                     (default: 128)
														
 
															+  -b, --batch-size <n>                (default: 512)
														
 
															+  -ctk <t>, --cache-type-k <t>        (default: f16)
														
 
															+  -ctv <t>, --cache-type-v <t>        (default: f16)
														
 
															+  -t, --threads <n>                   (default: 112)
														
 
															+  -ngl, --n-gpu-layers <n>            (default: 99)
														
 
															+  -sm, --split-mode <none|layer|row>  (default: layer)
														
 
															+  -mg, --main-gpu <i>                 (default: 0)
														
 
															+  -nkvo, --no-kv-offload <0|1>        (default: 0)
														
 
															+  -mmp, --mmap <0|1>                  (default: 1)
														
 
															+  -mmq, --mul-mat-q <0|1>             (default: 1)
														
 
															+  -ts, --tensor_split <ts0/ts1/..>    (default: 0)
														
 
															+  -r, --repetitions <n>               (default: 5)
														
 
															+  -o, --output <csv|json|md|sql>      (default: md)
														
 
															+  -v, --verbose                       (default: 0)
														
 
															 Multiple values can be given for each parameter by separating them with ',' or by specifying the parameter multiple times.
														
 
															 ```
														
@@ -51,6 +55,10 @@ Each test is repeated the number of times given by `-r`, and the results are ave
 
															 For a description of the other options, see the [main example](../main/README.md).
														
 
															+Note:
														
 
															+
														
 
															+- When using SYCL backend, there would be hang issue in some cases. Please set `--mmp 0`.
														
 
															+
														
 
															 ## Examples
														
 
															 ### Text generation with different models
														
--- a/examples/sycl/win-run-llama2.bat
+++ b/examples/sycl/win-run-llama2.bat
@@ -2,7 +2,7 @@
 
															 ::  Copyright (C) 2024 Intel Corporation
														
 
															 ::  SPDX-License-Identifier: MIT
														
 
															-INPUT2="Building a website can be done in 10 simple steps:\nStep 1:"
														
 
															+set INPUT2="Building a website can be done in 10 simple steps:\nStep 1:"
														
 
															 @call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64 --force