Skip to content
Phase 1 in progress

Leaderboard

Results land here at the end of Phase 1. Until then, here's what the leaderboard will show and how to read it.

Planned shape

ModelTierTask A (cat. F1)Task B (BT strength)Task C (JSON F1)Task D (faith.)$/1M tokp95 latency
Claude Sonnet 4.6pendingpendingpendingpendingpendingpending
Claude Haiku 4.5pendingpendingpendingpendingpendingpending
GPT-4opendingpendingpendingpendingpendingpending
GPT-4o minipendingpendingpendingpendingpendingpending
Llama 3.1 70B Q4pendingpendingpendingpendingpendingpending
Llama 3.1 8Bpendingpendingpendingpendingpendingpending
Qwen 2.5 14Bpendingpendingpendingpendingpendingpending
Qwen 2.5 7Bpendingpendingpendingpendingpendingpending
DeepSeek R1 distill 8Bpendingpendingpendingpendingpendingpending

How to read this

Why this is empty right now

The methodology is what makes the leaderboard defensible — not the other way around. Phase 0 (scoping, pre-flight, dataset construction) is finishing first. Phase 1 produces these numbers across one held-out lock and a fully-pinned model set. Read the methodology →