We measure routing-action F1, not authorship certainty.
VeracityAPI is judged on whether an agent routes content to the right next action: allow, revise, human_review, or reject. The first published seed benchmark contains 500 labeled text samples across firsthand human writing, dry factual human writing, generic AI slop, polished AI-with-specifics, and edge/adversarial mixed cases.
Balanced five-slice text corpus in data/evals/veracityapi_seed_corpus_500.jsonl.
Seed v0.1 action agreement on allow / revise / human_review labels.
Macro F1 across supported routing actions; reject has zero support in this seed set.
Confusion matrix
| Expected → Predicted | allow | revise | human_review | reject |
|---|---|---|---|---|
| allow | 190 | 10 | 0 | 0 |
| revise | 20 | 175 | 5 | 0 |
| human_review | 0 | 25 | 75 | 0 |
| reject | 0 | 0 | 0 | 0 |
Artifacts: data/evals/veracityapi_seed_results_v0_1.json, data/evals/veracityapi_seed_metrics_v0_1.csv, and scripts/evals-summary.mjs. This is a transparent seed calibration asset, not a forensic benchmark certification.
Per-action metrics
| Action | Precision | Recall | F1 | Support |
|---|---|---|---|---|
| allow | 0.905 | 0.950 | 0.927 | 200 |
| revise | 0.833 | 0.875 | 0.854 | 200 |
| human_review | 0.938 | 0.750 | 0.833 | 100 |
Dataset slices
- 100 human firsthand samples
- 100 dry factual human samples
- 100 generic AI slop samples
- 100 polished AI-with-specifics samples
- 100 edge/mixed/adversarial samples
Comparison scaffold
External comparison pending. Comparator columns are planned for GPTZero, Sapling, and an LLM judge. The metric will be routing-action F1, not detector-score accuracy.
External comparators
GPTZero, Sapling, and GPT-4o judge adapters are documented as planned comparator slots. They were not run in this shipment because no comparator keys were available in the local environment. The public claim remains routing-action F1, not AI-authorship proof; VeracityAPI reports workflow routing quality, not binary authorship evidence.
What this proves
The API contract is optimized for agents: inspect evidence, follow recommended_action, and measure reviewer agreement. The next iteration should replace seed expected labels with blind human labels and add competitor outputs where keys are available.
ESL / non-native English bias eval status
The next bias slice is a real human-writing benchmark for non-native English writers. It will not use synthetic LLM samples as a stand-in for ESL writing. Public false-positive rates stay unpublished until corpus licensing, speaker/writer metadata, native-language labels, and analysis artifacts are frozen.
Protocol scaffold: data/evals/esl-human-writing/README.md.
Known limits
- Seed labels are for workflow routing, not truth or authorship adjudication.
- Reject needs a dedicated abuse/spam corpus before reporting a meaningful reject F1.
- Image needs a separate labeled corpus before publishing comparable metrics.
- Scores should be paired with local policy and human escalation for high-stakes workflows.