Pre-execution scoping · Phase 1 in progress

Open-weight vs frontier LLMs, evaluated on a real production workload.

A defensible cost/quality leaderboard comparing 5 open-weight models (Llama, Qwen, DeepSeek — on a local DGX Spark via Ollama) against 4 frontier APIs (Anthropic + OpenAI) on Sift's production news pipeline.

Four tasks. Held-out discipline that's verifiable from the commit history. Cross-vendor judging to control for self-preference bias. Hardware-amortized cost methodology.

Read the methodology →Executive summary

Methodology

Cross-vendor judging

Sonnet 4.6 judges non-Anthropic pairs; GPT-4o judges Anthropic pairs. 50-pair calibration overlap with Cohen's κ ≥ 0.6 floor.

Reproducibility

Verifiable held-out

20% held-out set SHA-256 hashed pre-iteration. Hash committed to git before any prompt tuning. Anyone can verify the bound wasn't crossed.

Cost model

Hardware-amortized

Real DGX Spark capex + Florida kWh + utilization vs. published API rates. Dual-view: individual developer cost vs. fully-loaded production cost.

What's on this site

Methodology — full study design, scoring, statistical treatment, and cost model. The substantive page.
Leaderboard — results land at the end of Phase 1. Placeholder for now with the planned shape.
Executive summary — one-pager for hiring managers and senior reviewers.