Open-weight vs frontier LLMs, evaluated on a real production workload.
A defensible cost/quality leaderboard comparing 5 open-weight models (Llama, Qwen, DeepSeek — on a local DGX Spark via Ollama) against 4 frontier APIs (Anthropic + OpenAI) on Sift's production news pipeline.
Four tasks. Held-out discipline that's verifiable from the commit history. Cross-vendor judging to control for self-preference bias. Hardware-amortized cost methodology.
Sonnet 4.6 judges non-Anthropic pairs; GPT-4o judges Anthropic pairs. 50-pair calibration overlap with Cohen's κ ≥ 0.6 floor.
20% held-out set SHA-256 hashed pre-iteration. Hash committed to git before any prompt tuning. Anyone can verify the bound wasn't crossed.
Real DGX Spark capex + Florida kWh + utilization vs. published API rates. Dual-view: individual developer cost vs. fully-loaded production cost.
What's on this site
- Methodology — full study design, scoring, statistical treatment, and cost model. The substantive page.
- Leaderboard — results land at the end of Phase 1. Placeholder for now with the planned shape.
- Executive summary — one-pager for hiring managers and senior reviewers.