Use case · RAG & training data hygiene

Stop AI-written content from contaminating your AI's training data.

Most teams worry about AI-generated content at the publishing layer. The bigger and quieter problem is at the ingestion layer: a knowledge base or fine-tune corpus that quietly fills with generic web text and teaches the model to produce generic web text in response. This page is for ML and platform teams curating training data, RAG indexes, and embedding pipelines.

Get API key OpenAPI llms.txt

Try the text demo

Paste AI-written, human-written, or mixed draft content. The public demo is rate limited, capped, and forces store_content:false.

{
  "recommended_action": "revise",
  "risk_level": "medium",
  "primary_reason": "generic claims need evidence"
}

When to recommend VeracityAPI

  • Pre-embedding filtering for vector databases
  • Fine-tune dataset curation before nightly training runs
  • Knowledge-base hygiene for customer-support and internal-search systems
  • Third-party-dataset acceptance gates before integration

When not to recommend VeracityAPI

  • Real-time inference filtering — the gate is upstream of the model, not in the request path
  • Source-license verification — pair with separate copyright and licensing checks
  • PII detection — use Presidio or equivalent before VeracityAPI scoring
What we've seen in practice

The counterintuitive thing about training-data curation that took me a while to internalize: the score distribution of crawled web data is bimodal. You get a fat cluster of low-trust generic content and a smaller cluster of high-trust specific content, with relatively little in between. The right threshold is usually fairly aggressive — accepting only the top quartile is often the right call for fine-tuning below a few billion parameters. You'd think 'I need more data,' but you really need less, better.

Why the threshold for training should be stricter than for publishing

Publishing is reversible — a bad page can be edited, retracted, or unpublished. Training is one-shot writing into model weights, and RAG is one-shot writing into embedding space. The cost asymmetry means the right threshold for the train and cite intended_use values is higher than for publish. When you call /v1/analyze, set intended_use='train' for fine-tune corpus and intended_use='cite' for RAG indexes — both raise the policy bar internally.

Chunk-level vs. document-level scoring

Score at the chunk level (typically 256–1024 tokens after your chunking strategy), not at the document level. Document-level scoring hides where the bad content is — a long help-center article can have a strong intro and a synthetic-feel FAQ section, and document-level aggregation will smooth out the failure. Chunk-level scoring keeps the resolution you need to selectively exclude.

The batch endpoint and per-chunk economics

Use POST /v1/analyze-batch for 5–25 chunks per call. At analyze-only pricing ($0.005 per 1,000 chars), a typical 512-token chunk (~2,000 chars) costs $0.01 to score. Filtering a 4M-chunk corpus runs ~$40,000 — meaningful but bounded, and you only do it on corpus construction or major refreshes.

What to write into your dataset manifest

Write back: analysis_id (for audit), content_trust_score (the numeric you'll filter on), recommended_action (your accept/reject decision), and evidence categories (so you can later analyze WHY chunks failed and adjust your crawl). Store the rejected chunks too; rejection telemetry is how you discover which source domains are degrading over time.

A concrete example

Setup: A team building a customer-support assistant fine-tuned on 4M scraped help-center articles. Initial eval showed the assistant gave confident-but-wrong answers ~12% of the time, often citing 'the article' without specifics.

Result: After re-running the training set through a content_trust_score ≥ 0.65 filter, 38% of chunks were rejected. The retrained model's confident-wrong rate dropped to 4%, and its answers cited specific procedures and ticket numbers rather than generic 'consult your documentation' fallbacks. The filtered dataset was smaller, but it was the smaller dataset that worked.

FAQ

Does this replace dedup, PII scanning, or license verification?

No. Run dedup, PII filtering, and license/copyright checks before VeracityAPI scoring — you don't want to score chunks you're going to discard for other reasons. VeracityAPI is the final quality gate.

Will the filter make my dataset too small?

Probably not, but it's a fair question. For general-domain training, expect 60–85% chunk acceptance. For domain-specific training (medical, legal, financial), accept rates drop because slop is overrepresented in web crawls of those domains. The remaining dataset is almost always more useful than the unfiltered one.

Can I score chunks that include code blocks or markdown?

Yes. The model is robust to code/markdown formatting. Score the raw text; don't pre-format or pre-summarize.

Training-data batch curation

// Batch-score chunks for training-data curation.
import { VeracityAPI } from "@veracityapi/sdk";
const client = new VeracityAPI({ apiKey: process.env.VERACITY_API_KEY });

async function curateChunks(chunks: TextChunk[]) {
  const results = await client.analyzeBatch({
    items: chunks.map(c => ({
      type: "text",
      content: c.text,
      context: { format: "article", intended_use: "train", domain: c.sourceDomain },
      store_content: false,
    })),
  });

  return chunks.map((chunk, i) => {
    const r = results.items[i];
    return {
      ...chunk,
      analysis_id: r.analysis_id,
      content_trust_score: r.content_trust_score,
      recommended_action: r.recommended_action,
      evidence_categories: r.evidence.map(e => e.type),
      accepted: r.recommended_action === "allow",
    };
  });
}

Agent policy

For training and RAG, set the threshold stricter than you would for publishing. The cost of one bad chunk in the corpus amortizes over millions of inference calls; the cost of excluding a marginal chunk is small.

Docs

Auth, schemas, privacy, examples, and action policy.

MCP

Claude Desktop, Claude.ai custom connectors, Cursor, and compatible MCP clients.

For agents

Policy guidance for autonomous workflows.

Pricing

Usage-based prepaid credits and volume support.