A finding flagged by a single model proves little. The same finding, reproduced independently by seven, is evidence. Our scoring table rewards reproduction and discounts noise.
The final score begins at 100. Each confirmed finding deducts a penalty determined by two axes: the severity of the finding and the number of independent models that flagged it. A single model reporting a critical vulnerability costs twelve points. Seven models reporting the same critical costs sixty. The gap is the method.
| Severity | 1 model | 2–3 | 4–6 | 7+ |
|---|---|---|---|---|
| Critical | 12 | 24 | 40 | 60 |
| High | 6 | 12 | 20 | 30 |
| Medium | 3 | 6 | 10 | 15 |
| Low | 1 | 2 | 3 | 5 |
| Info | 0 | 0 | 0 | 0 |
We commission ten contemporary frontier models for each audit, drawn from independent laboratories across multiple jurisdictions. No single vendor’s model audits code produced by that same vendor. The panel is constructed to maximise independence: different training data, different architectures, different institutional incentives.
| Model | Laboratory | Origin | Class | Released |
|---|---|---|---|---|
| Grok 4.3 | xAI | US | Frontier | Mar 2026 |
| Claude Opus 4.7 | Anthropic | US | Frontier | Apr 2026 |
| Gemini 3.1 Pro | Google DeepMind | US | Frontier | Feb 2026 |
| GPT-5.5 Pro | OpenAI | US | Frontier | Apr 2026 |
| Llama 4 Maverick | Meta | US | Open weights | Jan 2026 |
| Qwen 3.6 Plus | Alibaba | China | Frontier | Mar 2026 |
| MiniMax M2.7 | MiniMax | China | Frontier | Feb 2026 |
| Kimi K2.6 | Moonshot AI | China | Frontier | Mar 2026 |
| Codestral 2508 | Mistral | France | Specialist | Aug 2025 |
| DeepSeek V4 Pro | DeepSeek | China | Frontier | Apr 2026 |
The panel is reviewed quarterly. Models are seated when they reach state-of-the-art on independent benchmarks, and retired when superseded.
A separate model — the scorer — adjudicates the ten reports. It reads all individual findings, identifies consensus, resolves conflicts, and computes the final score per the penalty table above. The scorer is never one of the panel. Its sole function is to weigh evidence, not to produce it.
Every report carries the combined SHA-256 of the source as audited. Anyone may re-run the audit against the same hash. The models may update, the findings may shift, but the source under examination is pinned and verifiable.