Benchmark proof ยท v0.2

We measure routing-action F1, not authorship certainty.

Benchmark v0.2 reports 0.871 macro F1 and 88.0% routing accuracy on a 500-item seed text corpus. VeracityAPI is judged on whether an agent routes content to the right next action: allow, revise, human_review, or reject.

Get API key For agents Docs

Seed corpus
500

Balanced five-slice text corpus in data/evals/veracityapi_seed_corpus_500.jsonl.

Routing accuracy
88.0%

Seed v0.1 action agreement on allow / revise / human_review labels.

Macro F1
0.871

Macro F1 across supported routing actions; reject has zero support in this seed set.

Confusion matrix

Expected โ†’ Predictedallowrevisehuman_reviewreject
allow1901000
revise2017550
human_review025750
reject0000

Artifacts: data/evals/veracityapi_seed_results_v0_1.json, data/evals/veracityapi_seed_metrics_v0_1.csv, and scripts/evals-summary.mjs. This is a transparent seed calibration asset, not a forensic benchmark certification.

Per-action metrics

ActionPrecisionRecallF1Support
allow0.9050.9500.927200
revise0.8330.8750.854200
human_review0.9380.7500.833100

Dataset slices

  • 100 human firsthand samples
  • 100 dry factual human samples
  • 100 generic AI slop samples
  • 100 polished AI-with-specifics samples
  • 100 edge/mixed/adversarial samples

Benchmark v0.2 completed

Headline metric: 0.871 macro F1 on 500 text routing samples. The benchmark is text-first and intentionally limited: it measures routing-action F1, not detector-score accuracy โ€” routing-action F1, not AI-authorship proof.

External comparators

GPTZero and Sapling comparator runs remain pending until API credentials, ToS, and artifact freezing are resolved. Pending means not run, not inferred.

What this proves

The API contract is optimized for agents: inspect evidence, follow recommended_action, and measure reviewer agreement. The next iteration should replace seed expected labels with blind human labels and add competitor outputs where keys are available.

Known limits