Andrew Ginns

🧜‍♀️ Merbench - LLM Evaluation

Getting LLMs to consistently nail the Mermaid diagram syntax can be... an adventure.

Merbench evaluates an LLM's ability to autonomously write and debug Mermaid syntax. The agent can access an MCP server that validates its code and provides error feedback, guiding it towards a correct solution.

Each model is tested across three difficulty levels, with a limited number of five attempts per test case. Performance is measured by the final success rate, averaged over complete runs, reflecting both an understanding of Mermaid syntax and effective tool usage.

Evaluation Summary

2012
Total Evaluation Runs
42
Models Evaluated
3
Test Cases

Providers Tested

AmazonGoogleOSSOpenAI
Data updated: Dec 12, 2025
Difficulty:
Provider:
What do these metrics mean?
Success Rate
The percentage of successful Mermaid diagram generations out of all runs.
Avg Cost/Run
The average cost in USD to generate one diagram, based on provider pricing.
Price/Success
The effective cost for each successful diagram, calculated as (Avg Cost / Success Rate).
Avg Duration
The average time in seconds taken to generate a diagram.
Avg Tokens
The average number of tokens (input + output) used per run.
Runs
The total number of times this model was run in the evaluation.

Model Leaderboard

Rank Model Success Rate Avg Cost/Run Price/Success Avg Duration Avg Tokens Runs Provider
1 gpt-5.2 (medium)
100.0%
$0.17
$0.17 43.19s 69,768 45 OpenAI
2 gpt-5.1-codex-max (medium)
97.8%
$0.12
$0.13 54.27s 78,085 45 OpenAI
3 gemini-3-pro-preview
97.8%
$0.25
$0.25 79.49s 80,741 45 Google
4 gpt-5 (medium)
93.3%
$0.14
$0.15 76.25s 74,458 45 OpenAI
5 gpt-5.1-codex (medium)
93.3%
$0.14
$0.15 63.58s 89,230 45 OpenAI
6 o3
93.3%
$0.15
$0.16 118.85s 64,968 45 OpenAI
7 gpt-5.1 (medium)
91.1%
$0.11
$0.12 53.77s 61,876 45 OpenAI
8 gpt-5.2 (none)
86.7%
$0.13
$0.15 21.41s 57,198 45 OpenAI
9 gpt-5-codex
80.0%
$0.08
$0.10 59.10s 47,452 45 OpenAI
10 o4-mini
62.2%
$0.08
$0.13 60.44s 59,667 45 OpenAI
11 gpt-5.1 (none)
55.6%
$0.03
$0.06 33.31s 17,784 45 OpenAI
12 gemini-2.5-flash-preview-09-2025
45.7%
$0.03
$0.07 46.23s 46,035 46 Google
13 gpt-4.1
42.2%
$0.01
$0.03 16.41s 4,300 45 OpenAI
14 gpt-5-mini
35.6%
$0.01
$0.02 51.49s 8,576 45 OpenAI
15 gemini-2.5-pro
35.6%
$0.10
$0.29 52.74s 31,300 45 Google
16 gpt-5.1-codex-mini (medium)
33.3%
$0.01
$0.04 31.13s 32,398 45 OpenAI
17 gemini-2.5-pro-preview-06-05
27.1%
$0.04
$0.14 36.92s 8,246 48 Google
18 gemini-2.5-pro-preview-05-06
26.7%
$0.13
$0.49 49.85s 19,754 45 Google
19 gemini-2.5-pro-preview-03-25
22.9%
$0.11
$0.49 57.17s 16,394 48 Google
20 gpt-4.1-mini
20.0%
$0.00
$0.01 23.43s 3,733 45 OpenAI
21 gpt-5-nano
13.3%
$0.00
$0.03 88.50s 11,963 45 OpenAI
22 gemini-2.5-flash
13.3%
$0.01
$0.09 10.15s 6,991 90 Google
23 qwen3-30b-a3b-thinking-2507-mlx
10.3%
$0.00
$0.02 92.27s 8,167 39 OSS
24 seed-oss-36b-instruct-mlx
6.3%
$0.00
$0.01 396.96s 3,054 16 OSS
25 gemini-2.5-flash-lite-preview-06-17
4.4%
$0.00
$0.02 4.40s 5,234 45 Google
26 gemini-2.5-flash-preview-05-20
4.4%
$0.01
$0.18 9.26s 5,120 45 Google
27 gemini-2.5-flash-preview-04-17
4.4%
$0.02
$0.52 24.15s 10,493 45 Google
28 gemini-2.5-flash-lite
3.3%
$0.00
$0.04 5.90s 9,507 90 Google
29 gpt-oss-20b
2.2%
$0.00
$0.01 47.90s 3,897 45 OSS
30 nova-premier-v1:0
2.2%
$0.03
$1.29 58.14s 7,520 45 Amazon
31 nova-micro-v1:0
0.0%
$0.00
N/A 19.31s 1,798 45 Amazon
32 nova-lite-v1:0
0.0%
$0.00
N/A 24.21s 3,091 45 Amazon
33 gemini-2.0-flash
0.0%
$0.00
N/A 4.21s 1,326 60 Google
34 xlam-2-32b-fc-r
0.0%
$0.00
N/A 171.86s 8,211 15 OSS
35 qwen3-coder-30b-a3b-instruct-mlx
0.0%
$0.00
N/A 21.61s 4,384 45 OSS
36 qwen3-30b-a3b-2507
0.0%
$0.00
N/A 23.04s 4,278 45 OSS
37 gpt-4.1-nano
0.0%
$0.00
N/A 18.99s 3,920 45 OpenAI
38 google/gemma-3-27b
0.0%
$0.00
N/A 120.44s 6,955 45 OSS
39 gemini-2.5-flash-lite-preview-09-2025
0.0%
$0.00
N/A 5.68s 5,688 45 Google
40 nova-pro-v1:0
0.0%
$0.00
N/A 49.14s 905 45 Amazon
41 llama-xlam-2-70b-fc-r
0.0%
$0.00
N/A 238.06s 8,592 15 OSS
42 magistral-small-2509-mlx
0.0%
$0.00
N/A 582.19s 4,439 15 OSS

Performance vs Efficiency Trade-offs

Loading chart data...

Performance by Difficulty Level

Loading chart data...

Token Usage Breakdown

Loading chart data...

Failure Analysis by Reason

Loading chart data...

Last updated: December 15, 2025 at 01:26 AM UTC