MCP-Universe Benchmarking Results
Results from the MCP-Universe benchmark evaluating Large Language Models (LLMs) across real-world tasks, compared with our specialized agents.
Our Performance Summary
Initial results from benchmarking our specialized agents - built with our no-code agent builder - in two categories of the MCP-Universe benchmark.
Location Navigation (LN)
71.19
success rate
+100%
vs best model
Repository Management (RM)
60.00
success rate
+230%
vs best model
Benchmarking Framework Overview
The MCP-Universe benchmark is a comprehensive evaluation framework implemented by SalesForce and designed to assess Large Language Models (LLMs) in realistic, real-world scenarios through interaction with actual Model Context Protocol (MCP) servers. The benchmark addresses critical challenges in AI agent evaluation, including:
- Long-horizon reasoning: Testing models' ability to maintain context and plan across extended task sequences
- Unfamiliar tool spaces: Evaluating how models adapt to new and complex tool interfaces
- Real-world complexity: Using actual MCP servers rather than simulated environments
- Multi-domain evaluation: Assessing performance across six core domains including Location Navigation and Repository Management
The benchmark evaluates models on tasks that require understanding complex tool APIs, maintaining state across multiple interactions, and successfully completing multi-step workflows. This provides a more accurate assessment of how models perform in production environments compared to traditional synthetic benchmarks.
Initial Benchmarking Results
Two specialized agents for Location Navigation (LN) and Repository Management (RM), built with our agent builder, achieved success scores of 71.19 for LN, 60.00 for RM, and 66.37 overall. The agents use Claude-3.7-Sonnet as the base LLM and simple function calling (no ReAct). Compared with the best-performing model in the benchmark (Gemini-3-Pro-Preview), this corresponds to a +100.20% improvement on LN and+230.03% on RM.
Our orchestration framework employs a more granular approach, making more calls per task (average of 32.2 steps) compared to other models. However, our framework reduces token consumption per call by providing minimal context to the LLM in each interaction, resulting in more efficient token usage overall while maintaining superior success rates.
Benchmark Results
| Model | Location (SR) | Repository (SR) | Average Steps | Average Success Rate |
|---|---|---|---|---|
| OrkestralAI specialized agents (FC) | 71.19 | 60.00 | 32.2 | 66.37 |
| Gemini-3-Pro-Preview (FC) | 35.56 | 18.18 | 7.8 | 44.59 |
| GPT-5-High (ReAct) | 26.67 | 30.30 | 6.84 | 44.16 |
| GPT-5-Medium (ReAct) | 33.33 | 30.30 | 8.22 | 43.72 |
| Grok-4.1-Fast (FC) | 28.89 | 15.15 | 6.32 | 40.69 |
| Claude-4.5-Sonnet (FC) | 26.67 | 12.12 | 9.54 | 35.06 |
| Grok-4 (ReAct) | 28.89 | 12.12 | 7.75 | 33.33 |
| Claude-4.0-Sonnet (FC) | 22.22 | 6.06 | 9.78 | 32.90 |
| Grok-4-Fast (FC) | 22.22 | 6.06 | 7.25 | 32.47 |
| Claude-4.0-Sonnet-Thinking(FC) | 24.44 | 6.06 | 9.12 | 31.60 |
| Claude-4.1-Opus (ReAct) | 17.78 | 21.21 | 7.04 | 29.44 |
| Claude-4.0-Opus (ReAct) | 15.56 | 15.15 | 7.69 | 28.14 |
| Grok-Code-Fast-1 (ReAct) | 26.67 | 9.09 | 6.87 | 26.41 |
| o3-Medium (ReAct) | 26.67 | 6.06 | 4.82 | 26.41 |
| Kimi-K2-Thinking (FC) | 20.00 | 12.12 | 8.15 | 26.41 |
| Claude-4.5-Haiku (FC) | 22.22 | 12.12 | 8.41 | 26.41 |
| o4-mini-Medium (ReAct) | 26.67 | 18.18 | 7.9 | 25.97 |
| GLM-4.6(ReAct) | 15.56 | 9.09 | 8.07 | 25.97 |
| GPT-OSS-120B (FC) | 24.44 | 15.15 | 7.53 | 25.54 |
| GLM-4.5 (ReAct) | 17.78 | 9.09 | 7.33 | 24.68 |
| Claude-3.7-Sonnet (ReAct) | 13.33 | 18.18 | 7.16 | 24.24 |
| Qwen3-Coder-480B-A35B-Instruct (ReAct) | 13.33 | 3.03 | 7.77 | 22.94 |
| Gemini-2.5-Pro | 13.33 | 12.12 | 6.98 | 22.08 |
| DeepSeek-V3.1 (ReAct) | 15.56 | 0.00 | 6.31 | 22.08 |
| Gemini-2.5-Flash (ReAct) | 15.56 | 12.12 | 8.26 | 21.65 |
| DeepSeek-V3.1-Terminus (ReAct) | 13.33 | 6.06 | 6.44 | 21.65 |
| GPT-4.1 (FC) | 15.56 | 6.06 | 6.83 | 19.91 |
| DeepSeek-V3.2-Exp (ReAct) | 17.78 | 0.00 | 6.48 | 19.91 |
| Kimi-K2-0905( | 11.11 | 3.03 | 6.96 | 19.91 |
| GLM-4.5-Air (ReAct) | 17.78 | 6.06 | 6.42 | 19.48 |
| Kimi-K2-0711 (ReAct) | 11.11 | 9.09 | 6.07 | 19.05 |
| Qwen3-Max-Preview (Instruct) (ReAct) | 20.00 | 6.06 | 5.5 | 18.18 |
| Qwen3-235B-A22B-Instruct-2507 (ReAct) | 11.11 | 9.09 | 5.74 | 18.18 |
| GPT-40-2024-12 | 8.89 | 9.09 | 6.03 | 15.58 |
| DeepSeek-V3 (ReAct) | 11.11 | 6.06 | 5.06 | 14.29 |
Source: MCP-Universe: https://mcp-universe.github.io/