|
@@ -82,6 +82,9 @@ Using the `-d <n>` option, each test can be run at a specified context depth, pr
|
|
|
|
|
|
|
|
For a description of the other options, see the [main example](../main/README.md).
|
|
For a description of the other options, see the [main example](../main/README.md).
|
|
|
|
|
|
|
|
|
|
+> [!NOTE]
|
|
|
|
|
+> The measurements with `llama-bench` do not include the times for tokenization and for sampling.
|
|
|
|
|
+
|
|
|
## Examples
|
|
## Examples
|
|
|
|
|
|
|
|
### Text generation with different models
|
|
### Text generation with different models
|
|
@@ -131,7 +134,7 @@ $ ./llama-bench -n 0 -n 16 -p 64 -t 1,2,4,8,16,32
|
|
|
| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CPU | 16 | pp 64 | 33.52 ± 0.03 |
|
|
| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CPU | 16 | pp 64 | 33.52 ± 0.03 |
|
|
|
| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CPU | 16 | tg 16 | 15.32 ± 0.05 |
|
|
| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CPU | 16 | tg 16 | 15.32 ± 0.05 |
|
|
|
| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CPU | 32 | pp 64 | 59.00 ± 1.11 |
|
|
| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CPU | 32 | pp 64 | 59.00 ± 1.11 |
|
|
|
-| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CPU | 32 | tg 16 | 16.41 ± 0.79 ||
|
|
|
|
|
|
|
+| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CPU | 32 | tg 16 | 16.41 ± 0.79 |
|
|
|
|
|
|
|
|
### Different numbers of layers offloaded to the GPU
|
|
### Different numbers of layers offloaded to the GPU
|
|
|
|
|
|