llama

Форк
0

README.md

llama.cpp/examples/llama-bench

Performance testing tool for llama.cpp.

Table of contents

  1. Syntax
  2. Examples
    1. Text generation with different models
    2. Prompt processing with different batch sizes
    3. Different numbers of threads
    4. Different numbers of layers offloaded to the GPU
  3. Output formats
    1. Markdown
    2. CSV
    3. JSON
    4. JSONL
    5. SQL

Syntax

usage: ./llama-bench [options]

options:
  -h, --help
  -m, --model <filename>                    (default: models/7B/ggml-model-q4_0.gguf)
  -p, --n-prompt <n>                        (default: 512)
  -n, --n-gen <n>                           (default: 128)
  -pg <pp,tg>                               (default: )
  -b, --batch-size <n>                      (default: 2048)
  -ub, --ubatch-size <n>                    (default: 512)
  -ctk, --cache-type-k <t>                  (default: f16)
  -ctv, --cache-type-v <t>                  (default: f16)
  -t, --threads <n>                         (default: 8)
  -C, --cpu-mask <hex,hex>                  (default: 0x0)
  --cpu-strict <0|1>                        (default: 0)
  --poll <0...100>                          (default: 50)
  -ngl, --n-gpu-layers <n>                  (default: 99)
  -rpc, --rpc <rpc_servers>                 (default: )
  -sm, --split-mode <none|layer|row>        (default: layer)
  -mg, --main-gpu <i>                       (default: 0)
  -nkvo, --no-kv-offload <0|1>              (default: 0)
  -fa, --flash-attn <0|1>                   (default: 0)
  -mmp, --mmap <0|1>                        (default: 1)
  --numa <distribute|isolate|numactl>       (default: disabled)
  -embd, --embeddings <0|1>                 (default: 0)
  -ts, --tensor-split <ts0/ts1/..>          (default: 0)
  -r, --repetitions <n>                     (default: 5)
  --prio <0|1|2|3>                          (default: 0)
  --delay <0...N> (seconds)                 (default: 0)
  -o, --output <csv|json|jsonl|md|sql>      (default: md)
  -oe, --output-err <csv|json|jsonl|md|sql> (default: none)
  -v, --verbose                             (default: 0)

Multiple values can be given for each parameter by separating them with ',' or by specifying the parameter multiple times.

llama-bench can perform three types of tests:

  • Prompt processing (pp): processing a prompt in batches (-p)
  • Text generation (tg): generating a sequence of tokens (-n)
  • Prompt processing + text generation (pg): processing a prompt followed by generating a sequence of tokens (-pg)

With the exception of -r, -o and -v, all options can be specified multiple times to run multiple tests. Each pp and tg test is run with all combinations of the specified options. To specify multiple values for an option, the values can be separated by commas (e.g. -n 16,32), or the option can be specified multiple times (e.g. -n 16 -n 32).

Each test is repeated the number of times given by -r, and the results are averaged. The results are given in average tokens per second (t/s) and standard deviation. Some output formats (e.g. json) also include the individual results of each repetition.

For a description of the other options, see the main example.

Note:

  • When using SYCL backend, there would be hang issue in some cases. Please set --mmp 0.

Examples

Text generation with different models

$ ./llama-bench -m models/7B/ggml-model-q4_0.gguf -m models/13B/ggml-model-q4_0.gguf -p 0 -n 128,256,512
modelsizeparamsbackendngltestt/s
llama 7B mostly Q4_03.56 GiB6.74 BCUDA99tg 128132.19 ± 0.55
llama 7B mostly Q4_03.56 GiB6.74 BCUDA99tg 256129.37 ± 0.54
llama 7B mostly Q4_03.56 GiB6.74 BCUDA99tg 512123.83 ± 0.25
llama 13B mostly Q4_06.86 GiB13.02 BCUDA99tg 12882.17 ± 0.31
llama 13B mostly Q4_06.86 GiB13.02 BCUDA99tg 25680.74 ± 0.23
llama 13B mostly Q4_06.86 GiB13.02 BCUDA99tg 51278.08 ± 0.07

Prompt processing with different batch sizes

$ ./llama-bench -n 0 -p 1024 -b 128,256,512,1024
modelsizeparamsbackendngln_batchtestt/s
llama 7B mostly Q4_03.56 GiB6.74 BCUDA99128pp 10241436.51 ± 3.66
llama 7B mostly Q4_03.56 GiB6.74 BCUDA99256pp 10241932.43 ± 23.48
llama 7B mostly Q4_03.56 GiB6.74 BCUDA99512pp 10242254.45 ± 15.59
llama 7B mostly Q4_03.56 GiB6.74 BCUDA991024pp 10242498.61 ± 13.58

Different numbers of threads

$ ./llama-bench -n 0 -n 16 -p 64 -t 1,2,4,8,16,32
modelsizeparamsbackendthreadstestt/s
llama 7B mostly Q4_03.56 GiB6.74 BCPU1pp 646.17 ± 0.07
llama 7B mostly Q4_03.56 GiB6.74 BCPU1tg 164.05 ± 0.02
llama 7B mostly Q4_03.56 GiB6.74 BCPU2pp 6412.31 ± 0.13
llama 7B mostly Q4_03.56 GiB6.74 BCPU2tg 167.80 ± 0.07
llama 7B mostly Q4_03.56 GiB6.74 BCPU4pp 6423.18 ± 0.06
llama 7B mostly Q4_03.56 GiB6.74 BCPU4tg 1612.22 ± 0.07
llama 7B mostly Q4_03.56 GiB6.74 BCPU8pp 6432.29 ± 1.21
llama 7B mostly Q4_03.56 GiB6.74 BCPU8tg 1616.71 ± 0.66
llama 7B mostly Q4_03.56 GiB6.74 BCPU16pp 6433.52 ± 0.03
llama 7B mostly Q4_03.56 GiB6.74 BCPU16tg 1615.32 ± 0.05
llama 7B mostly Q4_03.56 GiB6.74 BCPU32pp 6459.00 ± 1.11
llama 7B mostly Q4_03.56 GiB6.74 BCPU32tg 1616.41 ± 0.79

Different numbers of layers offloaded to the GPU

$ ./llama-bench -ngl 10,20,30,31,32,33,34,35
modelsizeparamsbackendngltestt/s
llama 7B mostly Q4_03.56 GiB6.74 BCUDA10pp 512373.36 ± 2.25
llama 7B mostly Q4_03.56 GiB6.74 BCUDA10tg 12813.45 ± 0.93
llama 7B mostly Q4_03.56 GiB6.74 BCUDA20pp 512472.65 ± 1.25
llama 7B mostly Q4_03.56 GiB6.74 BCUDA20tg 12821.36 ± 1.94
llama 7B mostly Q4_03.56 GiB6.74 BCUDA30pp 512631.87 ± 11.25
llama 7B mostly Q4_03.56 GiB6.74 BCUDA30tg 12840.04 ± 1.82
llama 7B mostly Q4_03.56 GiB6.74 BCUDA31pp 512657.89 ± 5.08
llama 7B mostly Q4_03.56 GiB6.74 BCUDA31tg 12848.19 ± 0.81
llama 7B mostly Q4_03.56 GiB6.74 BCUDA32pp 512688.26 ± 3.29
llama 7B mostly Q4_03.56 GiB6.74 BCUDA32tg 12854.78 ± 0.65
llama 7B mostly Q4_03.56 GiB6.74 BCUDA33pp 512704.27 ± 2.24
llama 7B mostly Q4_03.56 GiB6.74 BCUDA33tg 12860.62 ± 1.76
llama 7B mostly Q4_03.56 GiB6.74 BCUDA34pp 512881.34 ± 5.40
llama 7B mostly Q4_03.56 GiB6.74 BCUDA34tg 12871.76 ± 0.23
llama 7B mostly Q4_03.56 GiB6.74 BCUDA35pp 5122400.01 ± 7.72
llama 7B mostly Q4_03.56 GiB6.74 BCUDA35tg 128131.66 ± 0.49

Output formats

By default, llama-bench outputs the results in markdown format. The results can be output in other formats by using the -o option.

Markdown

$ ./llama-bench -o md
modelsizeparamsbackendngltestt/s
llama 7B mostly Q4_03.56 GiB6.74 BCUDA99pp 5122368.80 ± 93.24
llama 7B mostly Q4_03.56 GiB6.74 BCUDA99tg 128131.42 ± 0.59

CSV

$ ./llama-bench -o csv
build_commit,build_number,cuda,metal,gpu_blas,blas,cpu_info,gpu_info,model_filename,model_type,model_size,model_n_params,n_batch,n_threads,f16_kv,n_gpu_layers,main_gpu,mul_mat_q,tensor_split,n_prompt,n_gen,test_time,avg_ns,stddev_ns,avg_ts,stddev_ts
"3469684","1275","1","0","0","1","1","13th Gen Intel(R) Core(TM) i9-13900K","NVIDIA GeForce RTX 3090 Ti","models/7B/ggml-model-q4_0.gguf","llama 7B mostly Q4_0","3825065984","6738415616","512","16","1","99","0","1","0.00","512","0","2023-09-23T12:09:01Z","212155977","732372","2413.341687","8.305961"
"3469684","1275","1","0","0","1","1","13th Gen Intel(R) Core(TM) i9-13900K","NVIDIA GeForce RTX 3090 Ti","models/7B/ggml-model-q4_0.gguf","llama 7B mostly Q4_0","3825065984","6738415616","512","16","1","99","0","1","0.00","0","128","2023-09-23T12:09:02Z","969320879","2728399","132.052051","0.371342"

JSON

$ ./llama-bench -o json
[
{
"build_commit": "3469684",
"build_number": 1275,
"cuda": true,
"metal": false,
"gpu_blas": true,
"blas": true,
"cpu_info": "13th Gen Intel(R) Core(TM) i9-13900K",
"gpu_info": "NVIDIA GeForce RTX 3090 Ti",
"model_filename": "models/7B/ggml-model-q4_0.gguf",
"model_type": "llama 7B mostly Q4_0",
"model_size": 3825065984,
"model_n_params": 6738415616,
"n_batch": 512,
"n_threads": 16,
"f16_kv": true,
"n_gpu_layers": 99,
"main_gpu": 0,
"mul_mat_q": true,
"tensor_split": "0.00",
"n_prompt": 512,
"n_gen": 0,
"test_time": "2023-09-23T12:09:57Z",
"avg_ns": 212365953,
"stddev_ns": 985423,
"avg_ts": 2410.974041,
"stddev_ts": 11.163766,
"samples_ns": [ 213837238, 211635853, 212328053, 211329715, 212698907 ],
"samples_ts": [ 2394.34, 2419.25, 2411.36, 2422.75, 2407.16 ]
},
{
"build_commit": "3469684",
"build_number": 1275,
"cuda": true,
"metal": false,
"gpu_blas": true,
"blas": true,
"cpu_info": "13th Gen Intel(R) Core(TM) i9-13900K",
"gpu_info": "NVIDIA GeForce RTX 3090 Ti",
"model_filename": "models/7B/ggml-model-q4_0.gguf",
"model_type": "llama 7B mostly Q4_0",
"model_size": 3825065984,
"model_n_params": 6738415616,
"n_batch": 512,
"n_threads": 16,
"f16_kv": true,
"n_gpu_layers": 99,
"main_gpu": 0,
"mul_mat_q": true,
"tensor_split": "0.00",
"n_prompt": 0,
"n_gen": 128,
"test_time": "2023-09-23T12:09:59Z",
"avg_ns": 977425219,
"stddev_ns": 9268593,
"avg_ts": 130.965708,
"stddev_ts": 1.238924,
"samples_ns": [ 984472709, 974901233, 989474741, 970729355, 967548060 ],
"samples_ts": [ 130.019, 131.295, 129.362, 131.86, 132.293 ]
}
]

JSONL

$ ./llama-bench -o jsonl
{"build_commit":"3469684","build_number":1275,"cuda":true,"metal":false,"gpu_blas":true,"blas":true,"cpu_info":"13th Gen Intel(R) Core(TM) i9-13900K","gpu_info":"NVIDIA GeForce RTX 3090 Ti","model_filename":"models/7B/ggml-model-q4_0.gguf","model_type":"llama 7B mostly Q4_0","model_size":3825065984,"model_n_params":6738415616,"n_batch":512,"n_threads":16,"f16_kv":true,"n_gpu_layers":99,"main_gpu":0,"mul_mat_q":true,"tensor_split":"0.00","n_prompt":512,"n_gen":0,"test_time":"2023-09-23T12:09:57Z","avg_ns":212365953,"stddev_ns":985423,"avg_ts":2410.974041,"stddev_ts":11.163766,"samples_ns":[213837238,211635853,212328053,211329715,212698907],"samples_ts":[2394.34,2419.25,2411.36,2422.75,2407.16]}
{"build_commit":"3469684","build_number":1275,"cuda":true,"metal":false,"gpu_blas":true,"blas":true,"cpu_info":"13th Gen Intel(R) Core(TM) i9-13900K","gpu_info":"NVIDIA GeForce RTX 3090 Ti","model_filename":"models/7B/ggml-model-q4_0.gguf","model_type":"llama 7B mostly Q4_0","model_size":3825065984,"model_n_params":6738415616,"n_batch":512,"n_threads":16,"f16_kv":true,"n_gpu_layers":99,"main_gpu":0,"mul_mat_q":true,"tensor_split":"0.00","n_prompt":0,"n_gen":128,"test_time":"2023-09-23T12:09:59Z","avg_ns":977425219,"stddev_ns":9268593,"avg_ts":130.965708,"stddev_ts":1.238924,"samples_ns":[984472709,974901233,989474741,970729355,967548060],"samples_ts":[130.019,131.295,129.362,131.86,132.293]}

SQL

SQL output is suitable for importing into a SQLite database. The output can be piped into the sqlite3 command line tool to add the results to a database.

$ ./llama-bench -o sql
CREATE TABLE IF NOT EXISTS test (
build_commit TEXT,
build_number INTEGER,
cuda INTEGER,
metal INTEGER,
gpu_blas INTEGER,
blas INTEGER,
cpu_info TEXT,
gpu_info TEXT,
model_filename TEXT,
model_type TEXT,
model_size INTEGER,
model_n_params INTEGER,
n_batch INTEGER,
n_threads INTEGER,
f16_kv INTEGER,
n_gpu_layers INTEGER,
main_gpu INTEGER,
mul_mat_q INTEGER,
tensor_split TEXT,
n_prompt INTEGER,
n_gen INTEGER,
test_time TEXT,
avg_ns INTEGER,
stddev_ns INTEGER,
avg_ts REAL,
stddev_ts REAL
);
INSERT INTO test (build_commit, build_number, cuda, metal, gpu_blas, blas, cpu_info, gpu_info, model_filename, model_type, model_size, model_n_params, n_batch, n_threads, f16_kv, n_gpu_layers, main_gpu, mul_mat_q, tensor_split, n_prompt, n_gen, test_time, avg_ns, stddev_ns, avg_ts, stddev_ts) VALUES ('3469684', '1275', '1', '0', '0', '1', '1', '13th Gen Intel(R) Core(TM) i9-13900K', 'NVIDIA GeForce RTX 3090 Ti', 'models/7B/ggml-model-q4_0.gguf', 'llama 7B mostly Q4_0', '3825065984', '6738415616', '512', '16', '1', '99', '0', '1', '0.00', '512', '0', '2023-09-23T12:10:30Z', '212693772', '743623', '2407.240204', '8.409634');
INSERT INTO test (build_commit, build_number, cuda, metal, gpu_blas, blas, cpu_info, gpu_info, model_filename, model_type, model_size, model_n_params, n_batch, n_threads, f16_kv, n_gpu_layers, main_gpu, mul_mat_q, tensor_split, n_prompt, n_gen, test_time, avg_ns, stddev_ns, avg_ts, stddev_ts) VALUES ('3469684', '1275', '1', '0', '0', '1', '1', '13th Gen Intel(R) Core(TM) i9-13900K', 'NVIDIA GeForce RTX 3090 Ti', 'models/7B/ggml-model-q4_0.gguf', 'llama 7B mostly Q4_0', '3825065984', '6738415616', '512', '16', '1', '99', '0', '1', '0.00', '0', '128', '2023-09-23T12:10:31Z', '977925003', '4037361', '130.891159', '0.537692');

Использование cookies

Мы используем файлы cookie в соответствии с Политикой конфиденциальности и Политикой использования cookies.

Нажимая кнопку «Принимаю», Вы даете АО «СберТех» согласие на обработку Ваших персональных данных в целях совершенствования нашего веб-сайта и Сервиса GitVerse, а также повышения удобства их использования.

Запретить использование cookies Вы можете самостоятельно в настройках Вашего браузера.