Skip to content
Pre-execution scoping · Phase 1 in progress

Open-weight vs frontier LLMs, evaluated on a real production workload.

A defensible cost/quality leaderboard comparing 5 open-weight models (Llama, Qwen, DeepSeek — on a local DGX Spark via Ollama) against 4 frontier APIs (Anthropic + OpenAI) on Sift's production news pipeline.

Four tasks. Held-out discipline that's verifiable from the commit history. Cross-vendor judging to control for self-preference bias. Hardware-amortized cost methodology.

Methodology
Cross-vendor judging

Sonnet 4.6 judges non-Anthropic pairs; GPT-4o judges Anthropic pairs. 50-pair calibration overlap with Cohen's κ ≥ 0.6 floor.

Reproducibility
Verifiable held-out

20% held-out set SHA-256 hashed pre-iteration. Hash committed to git before any prompt tuning. Anyone can verify the bound wasn't crossed.

Cost model
Hardware-amortized

Real DGX Spark capex + Florida kWh + utilization vs. published API rates. Dual-view: individual developer cost vs. fully-loaded production cost.

What's on this site

  • Methodology — full study design, scoring, statistical treatment, and cost model. The substantive page.
  • Leaderboard — results land at the end of Phase 1. Placeholder for now with the planned shape.
  • Executive summary — one-pager for hiring managers and senior reviewers.