🧜♀️ Merbench - LLM Evaluation
Getting LLMs to consistently nail the Mermaid diagram syntax can be... an adventure.
Merbench evaluates an LLM's ability to autonomously write and debug Mermaid syntax. The agent can access an MCP server that validates its code and provides error feedback, guiding it towards a correct solution.
Each model is tested across three difficulty levels, with a limited number of five attempts per test case. Performance is measured by the final success rate, averaged over complete runs, reflecting both an understanding of Mermaid syntax and effective tool usage.
Evaluation Summary
1159
 Total Evaluation Runs
 25
 Models Evaluated
 3
 Test Cases
 Providers Tested
 Data updated: Oct 17, 2025 
  Difficulty:    
   Provider:    
  What do these metrics mean?
- Success Rate
- The percentage of successful Mermaid diagram generations out of all runs.
- Avg Cost/Run
- The average cost in USD to generate one diagram, based on provider pricing.
- Price/Success
- The effective cost for each successful diagram, calculated as (Avg Cost / Success Rate).
- Avg Duration
- The average time in seconds taken to generate a diagram.
- Avg Tokens
- The average number of tokens (input + output) used per run.
- Runs
- The total number of times this model was run in the evaluation.
Model Leaderboard
| Rank | Model | Success Rate ↓ | Avg Cost/Run | Price/Success | Avg Duration | Avg Tokens | Runs | Provider | 
|---|---|---|---|---|---|---|---|---|
| 1 | gemini-2.5-flash-preview-09-2025 | $0.0661 | 32.33s | 22,980.822 | 45 | |||
| 2 | gemini-2.5-pro-preview-06-05 | $0.1302 | 36.84s | 8,111.882 | 51 | |||
| 3 | gemini-2.5-pro-preview-05-06 | $0.4904 | 49.85s | 19,753.911 | 45 | |||
| 4 | gemini-2.5-pro-preview-03-25 | $0.4942 | 57.17s | 16,393.313 | 48 | |||
| 5 | gemini-2.5-pro | $0.2722 | 32.94s | 14,255.511 | 45 | |||
| 6 | gemini-2.5-flash | $0.0957 | 10.15s | 6,990.467 | 45 | |||
| 7 | qwen3-30b-a3b-thinking-2507-mlx | $0.0168 | 92.27s | 8,166.795 | 39 | OSS | ||
| 8 | seed-oss-36b-instruct-mlx | $0.0150 | 396.96s | 3,053.438 | 16 | OSS | ||
| 9 | gemini-2.5-flash-preview-05-20 | $0.2014 | 9.75s | 5,771.55 | 60 | |||
| 10 | gemini-2.5-flash-lite-preview-06-17 | $0.0163 | 4.40s | 4,974.583 | 60 | |||
| 11 | gemini-2.5-flash-preview-04-17 | $0.5237 | 24.15s | 10,492.711 | 45 | |||
| 12 | gemini-2.5-flash-lite | $0.0382 | 5.90s | 9,506.689 | 90 | |||
| 13 | us.amazon.nova-premier-v1:0 | $1.0692 | 63.19s | 9,528.967 | 60 | Amazon | ||
| 14 | gpt-oss-20b | $0.0111 | 47.90s | 3,896.022 | 45 | OSS | ||
| 15 | us.amazon.nova-pro-v1:0 | N/A | 49.53s | 678.15 | 60 | Amazon | ||
| 16 | us.amazon.nova-micro-v1:0 | N/A | 18.83s | 1,783.85 | 60 | Amazon | ||
| 17 | us.amazon.nova-lite-v1:0 | N/A | 24.54s | 2,799.317 | 60 | Amazon | ||
| 18 | gemini-2.0-flash | N/A | 4.21s | 1,325.667 | 60 | |||
| 19 | google/gemma-3-27b | N/A | 120.44s | 6,954.467 | 45 | OSS | ||
| 20 | qwen3-coder-30b-a3b-instruct-mlx | N/A | 21.61s | 4,383.356 | 45 | OSS | ||
| 21 | qwen/qwen3-30b-a3b-2507 | N/A | 23.04s | 4,277.978 | 45 | OSS | ||
| 22 | magistral-small-2509-mlx | N/A | 582.19s | 4,438.333 | 15 | OSS | ||
| 23 | llama-xlam-2-70b-fc-r | N/A | 238.06s | 8,591.267 | 15 | OSS | ||
| 24 | gemini-2.5-flash-lite-preview-09-2025 | N/A | 5.68s | 5,687.822 | 45 | |||
| 25 | xlam-2-32b-fc-r | N/A | 171.86s | 8,210.2 | 15 | OSS | 
Performance vs Efficiency Trade-offs
Loading chart data...
Performance by Difficulty Level
Loading chart data...
Token Usage Breakdown
Loading chart data...
Failure Analysis by Reason
Loading chart data...
  Last updated: October 25, 2025 at 01:14 AM UTC