Andrew Ginns

🧜‍♀️ Merbench - LLM Evaluation

Getting LLMs to consistently nail the Mermaid diagram syntax can be... an adventure.

Merbench evaluates an LLM's ability to autonomously write and debug Mermaid syntax. The agent can access an MCP server that validates its code and provides error feedback, guiding it towards a correct solution.

Each model is tested across three difficulty levels, with a limited number of five attempts per test case. Performance is measured by the final success rate, averaged over complete runs, reflecting both an understanding of Mermaid syntax and effective tool usage.

Evaluation Summary

2057
Total Evaluation Runs
46
Models Evaluated
3
Test Cases

Providers Tested

AmazonAnthropicGoogleOSSOpenAI
Data updated: Jan 14, 2026
Difficulty:
Provider:
What do these metrics mean?
Success Rate
The percentage of successful Mermaid diagram generations out of all runs.
Avg Cost/Run
The average cost in USD to generate one diagram, based on provider pricing.
Price/Success
The effective cost for each successful diagram, calculated as (Avg Cost / Success Rate).
Avg Duration
The average time in seconds taken to generate a diagram.
Avg Tokens
The average number of tokens (input + output) used per run.
Runs
The total number of times this model was run in the evaluation.

Model Leaderboard

Rank Model Success Rate Avg Cost/Run Price/Success Avg Duration Avg Tokens Runs Provider
1 gpt-5.2-codex (medium)
100.0%
$0.14
$0.14 54.06s 64,101 45 OpenAI
2 gpt-5.2 (medium)
100.0%
$0.17
$0.17 43.19s 69,768 45 OpenAI
3 claude-opus-4.5
100.0%
$0.53
$0.53 58.24s 93,280 45 Anthropic
4 gpt-5.1-codex-max (medium)
97.8%
$0.12
$0.13 54.27s 78,085 45 OpenAI
5 gemini-3-pro-preview
97.8%
$0.25
$0.25 79.49s 80,741 45 Google
6 claude-sonnet-4.5
95.6%
$0.26
$0.27 58.42s 75,860 45 Anthropic
7 gpt-5 (medium)
93.3%
$0.14
$0.15 76.25s 74,458 45 OpenAI
8 gpt-5.1-codex (medium)
93.3%
$0.14
$0.15 63.58s 89,230 45 OpenAI
9 o3
93.3%
$0.15
$0.16 118.85s 64,968 45 OpenAI
10 gpt-5.1 (medium)
91.1%
$0.11
$0.12 53.77s 61,876 45 OpenAI
11 gpt-5.2 (none)
86.7%
$0.13
$0.15 21.41s 57,198 45 OpenAI
12 gpt-5-codex
80.0%
$0.08
$0.10 59.10s 47,452 45 OpenAI
13 gemini-3-flash-preview
77.8%
$0.05
$0.06 40.33s 69,554 45 Google
14 o4-mini
62.2%
$0.08
$0.13 60.44s 59,667 45 OpenAI
15 gpt-5.1 (none)
55.6%
$0.03
$0.06 33.31s 17,784 45 OpenAI
16 gemini-2.5-flash-preview-09-2025
45.7%
$0.03
$0.07 46.23s 46,035 46 Google
17 gpt-4.1
42.2%
$0.01
$0.03 16.41s 4,300 45 OpenAI
18 gpt-5-mini
35.6%
$0.01
$0.02 51.49s 8,576 45 OpenAI
19 gemini-2.5-pro
35.6%
$0.10
$0.29 52.74s 31,300 45 Google
20 gpt-5.1-codex-mini (medium)
33.3%
$0.01
$0.04 31.13s 32,398 45 OpenAI
21 gemini-2.5-pro-preview-06-05
27.1%
$0.04
$0.14 36.92s 8,246 48 Google
22 gemini-2.5-pro-preview-05-06
26.7%
$0.13
$0.49 49.85s 19,754 45 Google
23 gemini-2.5-pro-preview-03-25
22.9%
$0.11
$0.49 57.17s 16,394 48 Google
24 gpt-4.1-mini
20.0%
$0.00
$0.01 23.43s 3,733 45 OpenAI
25 gpt-5-nano
13.3%
$0.00
$0.03 88.50s 11,963 45 OpenAI
26 gemini-2.5-flash
13.3%
$0.01
$0.09 10.15s 6,991 90 Google
27 qwen3-30b-a3b-thinking-2507-mlx
10.3%
$0.00
$0.02 92.27s 8,167 39 OSS
28 seed-oss-36b-instruct-mlx
6.3%
$0.00
$0.01 396.96s 3,054 16 OSS
29 gemini-2.5-flash-lite-preview-06-17
4.4%
$0.00
$0.02 4.40s 5,234 45 Google
30 gemini-2.5-flash-preview-05-20
4.4%
$0.01
$0.18 9.26s 5,120 45 Google
31 gemini-2.5-flash-preview-04-17
4.4%
$0.02
$0.52 24.15s 10,493 45 Google
32 gemini-2.5-flash-lite
3.3%
$0.00
$0.04 5.90s 9,507 90 Google
33 gpt-oss-20b
2.2%
$0.00
$0.01 47.90s 3,897 45 OSS
34 nova-premier-v1:0
2.2%
$0.03
$1.29 58.14s 7,520 45 Amazon
35 nova-micro-v1:0
0.0%
$0.00
N/A 19.31s 1,798 45 Amazon
36 nova-lite-v1:0
0.0%
$0.00
N/A 24.21s 3,091 45 Amazon
37 gemini-2.0-flash
0.0%
$0.00
N/A 4.21s 1,326 60 Google
38 xlam-2-32b-fc-r
0.0%
$0.00
N/A 171.86s 8,211 15 OSS
39 qwen3-coder-30b-a3b-instruct-mlx
0.0%
$0.00
N/A 21.61s 4,384 45 OSS
40 qwen3-30b-a3b-2507
0.0%
$0.00
N/A 23.04s 4,278 45 OSS
41 gpt-4.1-nano
0.0%
$0.00
N/A 18.99s 3,920 45 OpenAI
42 gemma-3-27b
0.0%
$0.00
N/A 120.44s 6,955 45 OSS
43 gemini-2.5-flash-lite-preview-09-2025
0.0%
$0.00
N/A 5.68s 5,688 45 Google
44 nova-pro-v1:0
0.0%
$0.00
N/A 49.14s 905 45 Amazon
45 llama-xlam-2-70b-fc-r
0.0%
$0.00
N/A 238.06s 8,592 15 OSS
46 magistral-small-2509-mlx
0.0%
$0.00
N/A 582.19s 4,439 15 OSS

Performance vs Efficiency Trade-offs

Loading chart data...

Performance by Difficulty Level

Loading chart data...

Token Usage Breakdown

Loading chart data...

Failure Analysis by Reason

Loading chart data...

Last updated: February 10, 2026 at 12:30 AM UTC