2026-05-15

Why we don't publish competitor benchmark numbers (yet)

The most common question I get is 'how does VeracityAPI compare to GPTZero on accuracy?' The honest answer is that I don't have a number I'm willing to publish — and here's the system we're building to produce one.

Benchmark status Docs

The most common question I get on sales calls and Twitter DMs is some version of: 'how does VeracityAPI compare to GPTZero on accuracy?'

The honest answer is that I don't have a number I'm willing to publish. Not because the comparison would be unflattering — I don't actually know what it would show, which is the point — but because the comparison done badly is worse than no comparison at all. This post is the long-form version of why, and what we're building instead.

Most published benchmarks in the AI-detection category fall into one of three problematic patterns.

Pattern 1: The vendor-self-benchmark. A vendor publishes their own accuracy numbers against a hand-picked corpus of their own design. The numbers always look good. They don't replicate when independent researchers try them. The benchmark serves as marketing collateral, not as evaluation. Every reader knows this is happening and discounts the numbers accordingly, which is rational but doesn't help anyone make a buying decision.

Pattern 2: The leaderboard-game. A third party publishes a benchmark with results across multiple vendors. The leaderboard becomes the metric vendors optimize for. Within a quarter, every vendor's published numbers cluster within a fraction of a point of each other — not because the vendors converged on quality, but because they all overfit to the public corpus. The leaderboard stops being informative.

Pattern 3: The 'I ran some tests' blog post. A practitioner publishes their own informal benchmark on a small dataset, usually with a specific failure mode the practitioner cares about. The post goes viral. Within a week, half the threads in the category are referencing it as ground truth. The practitioner did good work, but a small informal benchmark gets cited as if it were a comprehensive study.

All three patterns share a root problem: the benchmark isn't structured to survive scrutiny. The corpus isn't licensed and reproducible. The methodology isn't documented in a way that supports replication. The results aren't versioned with the model versions at the time of the run. Six months later, no one can recreate the run or check the work.

The 2026 benchmark program we're building is structured specifically to avoid those patterns. The design constraints I want to share publicly:

First: licensed, frozen corpus. The text corpus is 1,000 samples drawn from a mix of human-written, generated, polished-AI-with-specifics, and adversarial categories. The image corpus is a smaller pilot (120 items). Every item has a documented licensing path and is reproducible by anyone with the same corpus.

Second: vendor terms cleared before any vendor numbers appear. Most detection vendors have terms of service that govern benchmark publication. We're working through those terms vendor by vendor before publishing any competitor-specific numbers. If a vendor's terms preclude the kind of benchmark we want to publish, we'll publish the structural results (corpus composition, methodology, our own numbers) without that vendor's specific numbers and disclose why.

Third: both binary-flagging F1 AND routing-action F1 reported. This is the most important commitment. The 'AI-detection' category is splitting into two product categories (I wrote a longer post about this) and reporting only one metric obscures the comparison. Binary-flagging F1 is the metric Category-1 products (authorship-likelihood detectors) optimize for; routing-action F1 is the metric VeracityAPI is built around. We'll publish both. Some products will look better on one; we expect to look better on the other.

Fourth: 'where Veracity loses' stays on the page. Even if the final benchmark is favorable for us, the failure-mode slices stay published. English-first calibration. Image-screenshot-and-recompression weakness. Adversarial-sample weaknesses. The benchmark isn't a sales tool; it's an evaluation artifact. Hiding the weaknesses would defeat the point.

Fifth: frozen run manifest with versioned model identifiers. Every result is tied to the specific model version each vendor was running at run time. Six months from now, when the models have evolved, the results will be a snapshot of that specific moment — useful for trend analysis, not eternally valid.

The benchmark status page at /evals/2026-benchmark shows the current state. As of this writing, the corpus design is frozen, the vendor terms review is in progress for four named competitors, and the planned publish date is later in 2026 once the legal-clearance gate completes.

I know this is slower than publishing a sales-friendly number this quarter. I think the slower path is the only one that produces a benchmark anyone should believe. The category has had enough cycles of marketing-driven numbers; I'd rather contribute one careful artifact than five fast ones.

If you're evaluating VeracityAPI for production right now and the question 'how does it compare?' is blocking your decision, here's the version of an answer I can stand behind: VeracityAPI publishes its own routing-action F1 (0.871 macro F1 on a 500-item seed corpus) and the underlying JSONL artifacts. We don't publish competitor numbers yet. Use the published numbers, try the free tier, and judge against your actual workflow. That's the honest comparison the eventual benchmark will formalize.

Required caveat: VeracityAPI is a workflow-routing API, not forensic authorship proof. See /methodology for what we claim and don't claim.

About the author

Bernard Huang · Founder, VeracityAPI

Co-founded Clearscope and bootstrapped it to 7-figure ARR over 10 years of working with editorial and content teams at companies like Nvidia, HubSpot, Adobe, IBM, and Condé Nast. Now building VeracityAPI — content trust infrastructure for autonomous agent workflows.

More about Bernard