The models from 2026 have thinking enabled in this dashboard.
Overall score is benchmark-normalized, so each benchmark compares models against the others on that task before averaging. Missing coverage reduces the final score. For classification tasks the primary score is accuracy; for SweParaphrase it is Pearson correlation.
This view averages each model across the benchmarks used by the leaderboard. Left and up is better.
Each scatter plot shows one benchmark. The x-axis is average latency in seconds and the y-axis is that benchmark's primary score.
Compact per-benchmark summaries for the currently selected models.
Short descriptions for the benchmarks shown above.