Сравнение моделей по качеству tool calling и стоимости
BFCL (Berkeley, ICML 2025) — стандарт для оценки tool calling. Multi-Turn — колонка #11 на лидерборде. Тестирует многоходовые сценарии: вызов функций в несколько шагов, сохранение контекста, обработка отсутствующих функций/параметров, длинный контекст. Режим FC (native function calling).
| Источник | Метрика | Надёжность |
|---|---|---|
| Berkeley gorilla.cs.berkeley.edu | BFCL v4, Multi-Turn (#11) | Оригинал |
| llm-stats llm-stats.com | BFCL v3/v4 overall | Агрегатор |
| PPT pricepertoken.com | BFCL v3 overall | Агрегатор |
Scores НЕ сравнимы напрямую. Berkeley — Multi-Turn (800 тестов). Агрегаторы — overall (4441 тест). Модель с высоким single-turn, но слабым multi-turn получит высокий score на агрегаторах, но низкий на Berkeley.
Berkeley лидер: Claude Opus 4.5 — 68.4% Multi-Turn.
Лучший бюджетный: Kimi K2 — 50.6% за 146₽ ср. DeepSeek V3.2 Exp — 44.9% за 34₽.
Qwen3 235B Thinking: 63.5% за 36₽ ср. Лучшее соотношение среди подтверждённых.
GPT-5.2 слаб в multi-turn: 28.1% (Berkeley). Сильна в single, слаба в многоходовых.
Наценка RouterAI: ~27–30% к OpenRouter. Исключения: Qwen3 235B Thinking (−44%), Kimi K2 Thinking (+3%).
[B]=Berkeley, [L]=llm-stats, [P]=PPT. Штриховка = агрегаторы.
○=Berkeley, □=llm-stats, △=PPT. Ось X — log.
Score делённый на среднюю цену RouterAI (руб/1M токенов). Выше = больше качества за рубль.
| # | Модель | Источник | Score | RAI вх ₽ | RAI исх ₽ | OR вх $ | OR исх $ | Нац. |
|---|---|---|---|---|---|---|---|---|
| 1 | Qwen3 Max | PPT | 74.9 | 79 | 399 | $0.78 | $3.90 | +30% |
| 2 | GLM-4.7-Flash | PPT | 74.6 | 6 | 40 | $0.06 | $0.40 | +27% |
| 3 | LongCat Flash | llm-stats | 74.4 | 20 | 81 | $0.20 | $0.80 | +28% |
| 4 | Qwen3.5 397B | llm-stats | 72.9 | 39 | 239 | $0.39 | $2.34 | +29% |
| 5 | Qwen3.5 122B | llm-stats | 72.2 | 26 | 212 | $0.26 | $2.08 | +29% |
| 6 | Qwen3.5 27B | llm-stats | 68.5 | 19 | 159 | $0.20 | $1.56 | +29% |
| 7 | Claude Opus 4.5 | Berkeley | 68.4 | 511 | 2558 | $5.00 | $25.00 | +30% |
| 8 | GLM-4.6 thinking | Berkeley | 68.0 | 40 | 173 | $0.39 | $1.90 | +18% |
| 9 | Qwen3.5-35B-A3B | llm-stats | 67.3 | 25 | 102 | $0.16 | $1.30 | +10% |
| 10 | Qwen3.5 9B | llm-stats | 66.1 | 5 | 15 | $0.05 | $0.15 | +27% |
| 11 | Kimi K2.5 | PPT | 64.5 | 39 | 176 | $0.38 | $1.72 | +30% |
| 12 | Qwen3 235B Thinking | Berkeley | 63.5 | 11 | 61 | $0.15 | $1.50 | -44% |
| 13 | INTELLECT-3 | PPT | 63.5 | 20 | 112 | $0.20 | $1.10 | +29% |
| 14 | Gemini 3 Pro | Berkeley | 63.1 | 204 | 1228 | $2.00 | $12.00 | +30% |
| 15 | Claude Sonnet 4.5 | Berkeley | 61.4 | 307 | 1535 | $3.00 | $15.00 | +30% |
| 16 | Qwen3 Coder 480B | Berkeley | 59.5 | 25 | 102 | $0.22 | $1.00 | +32% |
| 17 | Grok 4.1 Fast | Berkeley | 58.9 | 20 | 51 | $0.20 | $0.50 | +29% |
| 18 | Llama 4 Scout | PPT | 55.7 | 8 | 30 | $0.08 | $0.30 | +27% |
| 19 | Claude Haiku 4.5 | Berkeley | 53.6 | 102 | 511 | $1.00 | $5.00 | +30% |
| 20 | Kimi K2 | Berkeley | 50.6 | 58 | 235 | $0.57 | $2.30 | +30% |
| 21 | Command A Reasoning | Berkeley | 50.1 | 255 | 1023 | $2.50 | $10.00 | +30% |
| 22 | Qwen3 32B | Berkeley | 47.9 | 8 | 24 | $0.08 | $0.24 | +27% |
| 23 | MiniMax M1 | PPT | 47.8 | 45 | 180 | $0.40 | $2.20 | +10% |
| 24 | Qwen3 235B Instruct | Berkeley | 45.4 | 7 | 10 | $0.07 | $0.10 | +25% |
| 25 | Command-R-plus-08 | Berkeley | 45.4 | 255 | 1023 | $2.50 | $10.00 | +30% |
| 26 | DeepSeek V3.2 Exp | Berkeley | 44.9 | 27 | 41 | $0.27 | $0.41 | +27% |
| 27 | Kimi K2 Thinking | Berkeley | 42.5 | 48 | 204 | $0.60 | $2.50 | +3% |
| 28 | o4-mini | Berkeley | 41.8 | 112 | 450 | $1.10 | $4.40 | +30% |
| 29 | Qwen3 8B | Berkeley | 41.8 | 5 | 40 | $0.05 | $0.40 | +27% |
| 30 | DeepSeek R1-0528 | Berkeley | 41.0 | 46 | 220 | $0.45 | $2.15 | +30% |
| 31 | Phi 4 | PPT | 40.8 | 6 | 14 | $0.07 | $0.14 | +24% |
| 32 | GLM-4.5-Air | Berkeley | 40.0 | 13 | 86 | $0.13 | $0.85 | +28% |
| 33 | GPT-4.1 | Berkeley | 38.9 | 204 | 818 | $2.00 | $8.00 | +30% |
| 34 | GLM-4.5 | Berkeley | 38.9 | 61 | 225 | $0.60 | $2.20 | +30% |
| 35 | Gemini 2.5 Flash | Berkeley | 36.3 | 20 | 102 | $0.15 | $0.60 | +49% |
| 36 | GLM-4.6-Air thinking | Berkeley | 39.4 | 13 | 66 | $0.13 | $0.65 | +23% |
| 37 | GLM-4.6 | Berkeley | 36.0 | 26 | 112 | $0.26 | $1.10 | +24% |
| 38 | GPT-5-nano | Berkeley | 34.5 | 102 | 511 | $1.00 | $5.00 | +30% |
| 39 | GPT-4.1-mini | Berkeley | 34.1 | 51 | 204 | $0.50 | $2.00 | +30% |
| 40 | Qwen3 14B | Berkeley | 34.8 | 5 | 30 | $0.05 | $0.30 | +21% |
| 41 | Qwen3 30B A3B | Berkeley | 30.0 | 13 | 51 | $0.13 | $0.50 | +24% |
| 42 | Gemini 2.5 Pro | Berkeley | 32.0 | 153 | 1022 | $1.50 | $10.00 | +3% |
| 43 | Grok 4 | Berkeley | 33.9 | 122 | 612 | $1.20 | $6.00 | +30% |
| 44 | DeepSeek V3.2 | Berkeley | 27.8 | 20 | 102 | $0.20 | $1.00 | +3% |
| 45 | GPT-4.1-nano | Berkeley | 23.6 | 10 | 41 | $0.10 | $0.40 | +29% |
| 46 | Llama 3.3 70B | Berkeley | 21.5 | 6 | 40 | $0.06 | $0.40 | +27% |
| 47 | Mistral Large | Berkeley | 14.1 | 122 | 612 | $1.20 | $6.00 | +30% |
| 48 | Gemini 2.5 Flash Lite | Berkeley | 13.5 | 2 | 10 | $0.01 | $0.07 | +43% |
| 49 | Mistral Small | Berkeley | 11.5 | 10 | 51 | $0.10 | $0.50 | +29% |
| 50 | Nova Pro | Berkeley | 1.9 | 78 | 327 | $0.80 | $3.20 | +30% |
Собрано 8 апреля 2026 · Курс ЦБ: 78.75 ₽/$ · mark.magserv.ru