Phase 1 in progress

Leaderboard

Results land here at the end of Phase 1. Until then, here's what the leaderboard will show and how to read it.

Planned shape

Model	Tier	Task A (cat. F1)	Task B (BT strength)	Task C (JSON F1)	Task D (faith.)	$/1M tok	p95 latency
Claude Sonnet 4.6	—	pending	pending	pending	pending	pending	pending
Claude Haiku 4.5	—	pending	pending	pending	pending	pending	pending
GPT-4o	—	pending	pending	pending	pending	pending	pending
GPT-4o mini	—	pending	pending	pending	pending	pending	pending
Llama 3.1 70B Q4	—	pending	pending	pending	pending	pending	pending
Llama 3.1 8B	—	pending	pending	pending	pending	pending	pending
Qwen 2.5 14B	—	pending	pending	pending	pending	pending	pending
Qwen 2.5 7B	—	pending	pending	pending	pending	pending	pending
DeepSeek V2 Lite	—	pending	pending	pending	pending	pending	pending

How to read this

Tier splits models into deployment-feasible vs quality-ceiling. Llama 70B Q4 on a single DGX Spark is a quality reference, not a deployment-ready candidate at the throughput Sift needs.
Task A is single-label news categorization. Macro-F1 with bootstrap CI; label noise rate computed from a 100-article re-validation pass.
Task B is summarization. Bradley-Terry strength across 36 pairwise comparisons, fit with the MM algorithm. Cross-vendor judging (see methodology) controls for self-preference bias.
Task Cis structured entity extraction. Two metrics, not one: JSON schema validity rate AND entity F1 conditional on validity — so “great extractor, dropped a brace” isn't scored the same as “couldn't parse the article.”
Task D is grounded summarization with citation faithfulness on multi-article topic clusters.
$/1M tok uses a hardware-amortized model for local; published rates for APIs. Dual view (individual-developer / fully-loaded production) lives in the methodology page.

Why this is empty right now

The methodology is what makes the leaderboard defensible — not the other way around. Phase 0 (scoping, pre-flight, dataset construction) is finishing first. Phase 1 produces these numbers across one held-out lock and a fully-pinned model set. Read the methodology →