Open eval suite
for the Factlet Protocol.
A reproducible benchmark harness that compares LLM behavior with and without a factbook in context, across Claude, GPT, and Gemini. Pre-registered. MIT-licensed. Raw data published.
What's measured
Three conditions × three frontier models × hand-crafted developer tasks across payments, frontend, and ML pipeline domains. Each response is scored by three judges across five metrics — citation correctness, contradiction count, coverage honesty, quality (1–5), and risk-of-shipping. Inter-judge agreement is reported per metric so readers can discount low-confidence signals.
What we found at N=6 (with limits)
Across Claude Sonnet 4.6, GPT-4.1, and Gemini 2.0 Flash on six hand-crafted developer tasks: providing the model with a team-specific factbook (in any reasonable format) reduced harmful shipping recommendations from 61% of cells to 14%, and high-risk shipping recommendations from 61% to 3%. Median quality went from 2.7 to 4.1 on a 1–5 scale. The direction holds when the consensus is recomputed excluding the same-family Claude judge — deltas survive within 0.06 points.
Structured per-vendor rendering did not beat naive markdown grounding on outcome metrics in this run. Quality, contradictions, and coverage were comparable; the citation-metric gap is a renderer artifact (the naive renderer in this run did not include factlet IDs).
What's NOT supported by this run
- A single-number aggregate headline.
N=6 with a single task author does not support an "X% better" claim. Tier 2 (~N=100, externally-authored, vanilla-RAG and vendor-Memory comparators) is when an aggregate claim is on the table.
- Statistical significance.
Wilson 95% CI on a binary outcome at N=18 cells is ~±25pp. Direction-of-effect signals are visible; magnitudes are not pinned.
- Per-vendor leaderboards.
The eval measures whether grounding affects model behavior, not which model is best.
Pre-registration
The task set, judge prompts, scoring rubric, and analysis plan are pre-registered before runs in tier1/PREREG.md. Hashes use git ls-tree for filesystem-/locale-independent reproducibility. Stopping rules ("no peeking, no early stopping, no rubric iteration in response to numbers") are committed in writing.
Contributing tasks
Every task in the current scaffold is authored by the protocol author. The next eval run is gated on ≥5 externally-authored tasks landing. Domains especially welcome: security (auth, IAM, secret handling), devops (Terraform, Kubernetes, CI policy), data engineering (schema, pipeline conventions). Quality bar in CONTRIBUTING.md.
Reproducing
Clone the repo, install the runner (~400 lines of Python), set your API keys, run. ~$2.50 and ~50 minutes for the full pipeline at the current scaffold size.
git clone https://github.com/factlet-ai/evals
cd evals/runner && pip install -e .
export ANTHROPIC_API_KEY=… OPENAI_API_KEY=… GOOGLE_API_KEY=…
python validate.py --tasks ../tier1/tasks --factbooks ../tier1/factbooks
python run.py --tasks ../tier1/tasks --factbooks ../tier1/factbooks \
--output ../results/$(date +%Y-%m-%d)