Benchmark proof · v0.2

We measure routing-action F1, not authorship certainty.

Benchmark v0.2 reports 0.871 macro F1 and 88.0% routing accuracy on a 500-item seed text corpus. VeracityAPI is judged on whether an agent routes content to the right next action: allow, revise, human_review, or reject.

Get API key For agents Docs

Seed corpus

500

Balanced five-slice text corpus in data/evals/veracityapi_seed_corpus_500.jsonl.

Routing accuracy

88.0%

Seed v0.1 action agreement on allow / revise / human_review labels.

Macro F1

0.871

Macro F1 across supported routing actions; reject has zero support in this seed set.

Confusion matrix

Expected → Predicted	allow	revise	human_review
allow	190	10	0
revise	20	175	5
human_review	0	25	75
reject	0	0	0

Artifacts: data/evals/veracityapi_seed_results_v0_1.json, data/evals/veracityapi_seed_metrics_v0_1.csv, and scripts/evals-summary.mjs. This is a transparent seed calibration asset, not a forensic benchmark certification.

Per-action metrics

Action	Precision	Recall	F1	Support
allow	0.905	0.950	0.927	200
revise	0.833	0.875	0.854	200
human_review	0.938	0.750	0.833	100

Dataset slices

100 human firsthand samples
100 dry factual human samples
100 generic AI slop samples
100 polished AI-with-specifics samples
100 edge/mixed/adversarial samples

Benchmark v0.2 completed

Headline metric: 0.871 macro F1 on 500 text routing samples. The benchmark is text-first and intentionally limited: it measures routing-action F1, not detector-score accuracy — routing-action F1, not AI-authorship proof.

External comparators

GPTZero and Sapling comparator runs remain pending until API credentials, ToS, and artifact freezing are resolved. Pending means not run, not inferred.

What this proves

The API contract is optimized for agents: inspect evidence, follow recommended_action, and measure reviewer agreement. The next iteration should replace seed expected labels with blind human labels and add competitor outputs where keys are available.

Known limits

Seed labels are for workflow routing, not truth or authorship adjudication.
Reject needs a dedicated abuse/spam corpus before reporting a meaningful reject F1.
Image, audio, and video need separate labeled corpora before publishing comparable metrics.
External competitor numbers should not be published until credentials, ToS, and artifact freezes are resolved.
Scores should be paired with local policy and human escalation for high-stakes workflows.

Internal links

We measure routing-action F1, not authorship certainty.

Confusion matrix

Per-action metrics

Dataset slices

Benchmark v0.2 completed

External comparators

What this proves

Known limits

Explore VeracityAPI

What VeracityAPI detects

Error handling

Copy-paste examples

MCP tools