LLM-Benches
README.md
Результаты замера моделей доступных по API
MT-Bench Ru
########## First turn ##########
score
model turn
gpt-4 1 9.300000
gpt-4o-mini 1 9.112500
GigaChat-Pro-w_filter 1 8.975000
gpt-4o 1 8.868750
GigaChat-Max-wo_filter 1 8.787500
GigaChat-Max-w_filter 1 8.625000
GigaChat-Max-wo_filter-OS 1 8.600000
GigaChat-Pro-wo_filter 1 8.562500
GigaChat-Pro-wo_filter-OS 1 8.512500
GigaChat-w_filter 1 8.443750
gpt-3.5-turbo 1 8.337500
GigaChat-wo_filter 1 8.306250
GigaChat-wo_filter-OS 1 8.151899
YaGPT-Pro-v4-OS 1 7.631250
YaGPT-Lite-v4 1 7.456250
YaGPT-Lite-v4-OS 1 7.206250
YaGPT-Pro-v4 1 7.200000
########## Second turn ##########
score
model turn
gpt-4 2 9.025000
gpt-4o 2 8.762500
gpt-4o-mini 2 8.500000
GigaChat-Max-wo_filter 2 8.287500
GigaChat-Max-w_filter 2 8.175000
GigaChat-Max-wo_filter-OS 2 8.162500
gpt-3.5-turbo 2 8.025000
GigaChat-Pro-wo_filter 2 7.587500
GigaChat-Pro-wo_filter-OS 2 7.575000
YaGPT-Pro-v4-OS 2 7.450000
GigaChat-Pro-w_filter 2 7.450000
GigaChat-wo_filter-OS 2 7.205128
GigaChat-w_filter 2 7.100000
YaGPT-Pro-v4 2 7.012500
GigaChat-wo_filter 2 6.936709
YaGPT-Lite-v4 2 6.775000
YaGPT-Lite-v4-OS 2 6.537500
########## Average ##########
score
model
gpt-4 9.162500
gpt-4o 8.815625
gpt-4o-mini 8.806250
GigaChat-Max-wo_filter 8.537500
GigaChat-Max-w_filter 8.400000
GigaChat-Max-wo_filter-OS 8.381250
GigaChat-Pro-w_filter 8.212500
gpt-3.5-turbo 8.181250
GigaChat-Pro-wo_filter 8.075000
GigaChat-Pro-wo_filter-OS 8.043750
GigaChat-w_filter 7.771875
GigaChat-wo_filter-OS 7.681529
GigaChat-wo_filter 7.625786
YaGPT-Pro-v4-OS 7.540625
YaGPT-Lite-v4 7.115625
YaGPT-Pro-v4 7.106250
YaGPT-Lite-v4-OS 6.871875
Arena-Hard-Auto v0.1 En
gpt-4o | score: 71.5 | 95% CI: (-0.3, 0.8) | average #tokens: 607
gpt-4o-mini | score: 67.1 | 95% CI: (-1.2, 1.6) | average #tokens: 685
GigaChat-Max-wo_filter | score: 51.9 | 95% CI: (-1.5, 0.1) | average #tokens: 619
GigaChat-Max-w_filter | score: 50.7 | 95% CI: (-1.1, 1.6) | average #tokens: 607
gpt-4-0314 | score: 50.0 | 95% CI: (0.0, 0.0) | average #tokens: 423
GigaChat-Max-wo_filter-OS | score: 48.5 | 95% CI: (-1.2, 2.2) | average #tokens: 529
gpt-4-0613 | score: 34.8 | 95% CI: (-0.8, 0.3) | average #tokens: 354
gpt-4 | score: 32.7 | 95% CI: (3.8, -3.4) | average #tokens: 350
gpt-3.5-turbo | score: 21.1 | 95% CI: (-0.8, 1.1) | average #tokens: 327
gpt-3.5-turbo-0125 | score: 21.0 | 95% CI: (-0.9, 0.6) | average #tokens: 329
GigaChat-Pro-wo_filter-OS | score: 19.9 | 95% CI: (-0.4, 0.9) | average #tokens: 533
GigaChat-Pro-wo_filter | score: 18.0 | 95% CI: (-1.1, 0.5) | average #tokens: 770
GigaChat-wo_filter-OS | score: 11.2 | 95% CI: (-1.0, 0.8) | average #tokens: 600
GigaChat-wo_filter | score: 11.0 | 95% CI: (-1.0, 0.6) | average #tokens: 764
YaGPT-Pro-v4 | score: 10.7 | 95% CI: (-1.0, 0.6) | average #tokens: 424
YaGPT-Pro-v4-OS | score: 9.6 | 95% CI: (-1.1, 1.2) | average #tokens: 391
YaGPT-Lite-v4 | score: 6.7 | 95% CI: (-0.5, 0.2) | average #tokens: 356
YaGPT-Lite-v4-OS | score: 4.8 | 95% CI: (-0.8, 0.9) | average #tokens: 332
Arena-Hard-Auto v0.1 Ru
gpt-4-1106-preview | score: 91.9 | 95% CI: (-1.1, 1.1) | average #tokens: 830
gpt-4o | score: 83.2 | 95% CI: (-1.2, 1.4) | average #tokens: 792
gpt-4o-mini | score: 80.2 | 95% CI: (-1.9, 2.1) | average #tokens: 842
GigaChat-Max-wo_filter | score: 73.3 | 95% CI: (-1.5, 1.2) | average #tokens: 943
GigaChat-Max-w_filter | score: 70.7 | 95% CI: (-1.5, 1.3) | average #tokens: 930
GigaChat-Max-wo_filter-OS | score: 67.2 | 95% CI: (-0.7, 1.9) | average #tokens: 790
gpt-4 | score: 55.1 | 95% CI: (3.1, -3.5) | average #tokens: 461
gpt-4-0613 | score: 50.0 | 95% CI: (0.0, 0.0) | average #tokens: 458
GigaChat-Pro-wo_filter | score: 45.5 | 95% CI: (-1.0, 2.1) | average #tokens: 1012
gpt-3.5-turbo | score: 36.7 | 95% CI: (-2.5, 1.4) | average #tokens: 428
GigaChat-Pro-wo_filter-OS | score: 32.8 | 95% CI: (-1.5, 2.1) | average #tokens: 710
GigaChat-wo_filter | score: 25.9 | 95% CI: (-1.8, 1.7) | average #tokens: 1159
YaGPT-Pro-v4 | score: 23.7 | 95% CI: (-1.1, 0.7) | average #tokens: 709
YaGPT-Pro-v4-OS | score: 21.7 | 95% CI: (-1.9, 1.1) | average #tokens: 608
GigaChat-wo_filter-OS | score: 20.9 | 95% CI: (-1.5, 0.9) | average #tokens: 882
YaGPT-Lite-v4 | score: 18.3 | 95% CI: (-1.1, 1.0) | average #tokens: 548
YaGPT-Lite-v4-OS | score: 16.9 | 95% CI: (-0.9, 1.2) | average #tokens: 495
Полезные ссылки:
- Документация API GigaChat
- Python-библиотека для работы с GigaChat API - GigaChat
- Интеграция GigaChat в LangChain - langchain-gigachat
Примечания:
- В замере использовались модели доступные 30.10.2024 по API;
- Приписка
- в запрос добавлялся системный промпт OpenAI, если приписка отсутствует - системный промпт в запрос не добавлялся (используется дефолтное системное сообщение);OS - Приписка
- фильтр безопасности в замере включен,w_filter
- фильтр безопасности выключен.wo_filter - В MT-Bench температура варьировалась от категории запросов в соответствии конфигу, по-умолчанию для вопросов отсутствующих в конфиге использовалась
;temperature=0.7 - В Arena-Hard-Auto использовались жадные параметры генерации YaGPT/OpenAI -
, GigaChat -temperature=0
;temperature=1, top_p=0 - GigaChat/GigaChat-Pro/GigaChat-Max - модели доступны в режиме preview;
- Результаты прогона YaGPT не опубликованы, чтобы не нарушать условия использования.