Benchmark proof · v0.1.0

We measure routing-action F1, not authorship certainty.

VeracityAPI is judged on whether an agent routes content to the right next action: allow, revise, human_review, or reject. The first published seed benchmark contains 500 labeled text samples across firsthand human writing, dry factual human writing, generic AI slop, polished AI-with-specifics, and edge/adversarial mixed cases.

Get API key For agents Docs

Seed corpus
500

Balanced five-slice text corpus in data/evals/veracityapi_seed_corpus_500.jsonl.

Routing accuracy
88.0%

Seed v0.1 action agreement on allow / revise / human_review labels.

Macro F1
0.871

Macro F1 across supported routing actions; reject has zero support in this seed set.

Confusion matrix

Expected → Predictedallowrevisehuman_reviewreject
allow1901000
revise2017550
human_review025750
reject0000

Artifacts: data/evals/veracityapi_seed_results_v0_1.json, data/evals/veracityapi_seed_metrics_v0_1.csv, and scripts/evals-summary.mjs. This is a transparent seed calibration asset, not a forensic benchmark certification.

Per-action metrics

ActionPrecisionRecallF1Support
allow0.9050.9500.927200
revise0.8330.8750.854200
human_review0.9380.7500.833100

Dataset slices

  • 100 human firsthand samples
  • 100 dry factual human samples
  • 100 generic AI slop samples
  • 100 polished AI-with-specifics samples
  • 100 edge/mixed/adversarial samples

Comparison scaffold

External comparison pending. Comparator columns are planned for GPTZero, Sapling, and an LLM judge. The metric will be routing-action F1, not detector-score accuracy.

External comparators

GPTZero, Sapling, and GPT-4o judge adapters are documented as planned comparator slots. They were not run in this shipment because no comparator keys were available in the local environment. The public claim remains routing-action F1, not AI-authorship proof; VeracityAPI reports workflow routing quality, not binary authorship evidence.

What this proves

The API contract is optimized for agents: inspect evidence, follow recommended_action, and measure reviewer agreement. The next iteration should replace seed expected labels with blind human labels and add competitor outputs where keys are available.

ESL / non-native English bias eval status

The next bias slice is a real human-writing benchmark for non-native English writers. It will not use synthetic LLM samples as a stand-in for ESL writing. Public false-positive rates stay unpublished until corpus licensing, speaker/writer metadata, native-language labels, and analysis artifacts are frozen.

Protocol scaffold: data/evals/esl-human-writing/README.md.

Known limits