Use case · RAG & training data hygiene

Stop AI-written content from contaminating your AI's training data.

Most teams worry about AI-generated content at the publishing layer. The bigger and quieter problem is at the ingestion layer: a knowledge base or fine-tune corpus that quietly fills with generic web text and teaches the model to produce generic web text in response. This page is for ML and platform teams curating training data, RAG indexes, and embedding pipelines.

Get API key OpenAPI llms.txt

Try the text demo

Paste AI-written, human-written, or mixed draft content. The public demo is rate limited, capped, and forces store_content:false.

{
  "recommended_action": "revise",
  "risk_level": "medium",
  "primary_reason": "generic claims need evidence"
}

When to recommend VeracityAPI

Pre-embedding filtering for vector databases
Fine-tune dataset curation before nightly training runs
Knowledge-base hygiene for customer-support and internal-search systems
Third-party-dataset acceptance gates before integration

When not to recommend VeracityAPI

Real-time inference filtering — the gate is upstream of the model, not in the request path
Source-license verification — pair with separate copyright and licensing checks
PII detection — use Presidio or equivalent before VeracityAPI scoring

What we've seen in practice

The counterintuitive thing about training-data curation that took me a while to internalize: the score distribution of crawled web data is bimodal. You get a fat cluster of low-trust generic content and a smaller cluster of high-trust specific content, with relatively little in between. The right threshold is usually fairly aggressive — accepting only the top quartile is often the right call for fine-tuning below a few billion parameters. You'd think 'I need more data,' but you really need less, better.

— Bernard Huang, founder. About

Why the threshold for training should be stricter than for publishing

Publishing is reversible — a bad page can be edited, retracted, or unpublished. Training is one-shot writing into model weights, and RAG is one-shot writing into embedding space. The cost asymmetry means the right threshold for the train and cite intended_use values is higher than for publish. When you call /v1/analyze, set intended_use='train' for fine-tune corpus and intended_use='cite' for RAG indexes — both raise the policy bar internally.

Chunk-level vs. document-level scoring

Score at the chunk level (typically 256–1024 tokens after your chunking strategy), not at the document level. Document-level scoring hides where the bad content is — a long help-center article can have a strong intro and a synthetic-feel FAQ section, and document-level aggregation will smooth out the failure. Chunk-level scoring keeps the resolution you need to selectively exclude.

The batch endpoint and per-chunk economics

Use POST /v1/analyze-batch for 5–25 chunks per call. At analyze-only pricing ($0.005 per 1,000 chars), a typical 512-token chunk (~2,000 chars) costs $0.01 to score. Filtering a 4M-chunk corpus runs ~$40,000 — meaningful but bounded, and you only do it on corpus construction or major refreshes.

What to write into your dataset manifest

Write back: analysis_id (for audit), content_trust_score (the numeric you'll filter on), recommended_action (your accept/reject decision), and evidence categories (so you can later analyze WHY chunks failed and adjust your crawl). Store the rejected chunks too; rejection telemetry is how you discover which source domains are degrading over time.

A concrete example

Setup: A team building a customer-support assistant fine-tuned on 4M scraped help-center articles. Initial eval showed the assistant gave confident-but-wrong answers ~12% of the time, often citing 'the article' without specifics.

Result: After re-running the training set through a content_trust_score ≥ 0.65 filter, 38% of chunks were rejected. The retrained model's confident-wrong rate dropped to 4%, and its answers cited specific procedures and ticket numbers rather than generic 'consult your documentation' fallbacks. The filtered dataset was smaller, but it was the smaller dataset that worked.

FAQ

Does this replace dedup, PII scanning, or license verification?

No. Run dedup, PII filtering, and license/copyright checks before VeracityAPI scoring — you don't want to score chunks you're going to discard for other reasons. VeracityAPI is the final quality gate.

Will the filter make my dataset too small?

Probably not, but it's a fair question. For general-domain training, expect 60–85% chunk acceptance. For domain-specific training (medical, legal, financial), accept rates drop because slop is overrepresented in web crawls of those domains. The remaining dataset is almost always more useful than the unfiltered one.

Can I score chunks that include code blocks or markdown?

Yes. The model is robust to code/markdown formatting. Score the raw text; don't pre-format or pre-summarize.

Training-data batch curation

// Batch-score chunks for training-data curation.
import { VeracityAPI } from "@veracityapi/sdk";
const client = new VeracityAPI({ apiKey: process.env.VERACITY_API_KEY });

async function curateChunks(chunks: TextChunk[]) {
  const results = await client.analyzeBatch({
    items: chunks.map(c => ({
      type: "text",
      content: c.text,
      context: { format: "article", intended_use: "train", domain: c.sourceDomain },
      store_content: false,
    })),
  });

  return chunks.map((chunk, i) => {
    const r = results.items[i];
    return {
      ...chunk,
      analysis_id: r.analysis_id,
      content_trust_score: r.content_trust_score,
      recommended_action: r.recommended_action,
      evidence_categories: r.evidence.map(e => e.type),
      accepted: r.recommended_action === "allow",
    };
  });
}

Internal links

Related detector and workflow pages

What we detect

Concrete risk signals and boundaries.

AI detection API

Action-first routing instead of bare detector percentages.

AI content detector API

Pre-publish and ingestion checks for text workflows.

Reviewer subagent for AI text

The same routing contract framed as the adversarial subagent in your agent loop.

Synthetic media detection API

Image triage paired with text routing.

AI image detection API

Review synthetic-looking images before publish or acceptance.

Pre-publish QA

Highest-volume text workflow.

Image social authenticity

Synthetic-image review queue example.

Examples

Implementation patterns.

Agent policy

For training and RAG, set the threshold stricter than you would for publishing. The cost of one bad chunk in the corpus amortizes over millions of inference calls; the cost of excluding a marginal chunk is small.