Phase 1 in progress
Leaderboard
Results land here at the end of Phase 1. Until then, here's what the leaderboard will show and how to read it.
Planned shape
| Model | Tier | Task A (cat. F1) | Task B (BT strength) | Task C (JSON F1) | Task D (faith.) | $/1M tok | p95 latency |
|---|---|---|---|---|---|---|---|
| Claude Sonnet 4.6 | — | pending | pending | pending | pending | pending | pending |
| Claude Haiku 4.5 | — | pending | pending | pending | pending | pending | pending |
| GPT-4o | — | pending | pending | pending | pending | pending | pending |
| GPT-4o mini | — | pending | pending | pending | pending | pending | pending |
| Llama 3.1 70B Q4 | — | pending | pending | pending | pending | pending | pending |
| Llama 3.1 8B | — | pending | pending | pending | pending | pending | pending |
| Qwen 2.5 14B | — | pending | pending | pending | pending | pending | pending |
| Qwen 2.5 7B | — | pending | pending | pending | pending | pending | pending |
| DeepSeek R1 distill 8B | — | pending | pending | pending | pending | pending | pending |
How to read this
- Tier splits models into deployment-feasible vs quality-ceiling. Llama 70B Q4 on a single DGX Spark is a quality reference, not a deployment-ready candidate at the throughput Sift needs.
- Task A is multi-label news categorization. Macro-F1 with bootstrap CI; label noise rate computed from a 100-article re-validation pass.
- Task B is summarization. Bradley-Terry strength across 36 pairwise comparisons, fit with the MM algorithm. Cross-vendor judging (see methodology) controls for self-preference bias.
- Task Cis structured entity extraction. Two metrics, not one: JSON schema validity rate AND entity F1 conditional on validity — so “great extractor, dropped a brace” isn't scored the same as “couldn't parse the article.”
- Task D is grounded summarization with citation faithfulness on multi-article topic clusters.
- $/1M tok uses a hardware-amortized model for local; published rates for APIs. Dual view (individual-developer / fully-loaded production) lives in the methodology page.
Why this is empty right now
The methodology is what makes the leaderboard defensible — not the other way around. Phase 0 (scoping, pre-flight, dataset construction) is finishing first. Phase 1 produces these numbers across one held-out lock and a fully-pinned model set. Read the methodology →