Garbage-in prevention

Training-data curation

Filter your training corpus or RAG index before embeddings, not after fine-tuning fails. Generic boilerplate teaches your model to write generic boilerplate; specific, well-sourced text teaches it to write specifically and cite. Same compute, very different output.

Get API key All use cases Docs

What we've seen in practice

There's a counterintuitive thing about training-data curation that took me a while to internalize: the score distribution of crawled web data is bimodal. You get a fat cluster of low-trust generic content and a smaller cluster of high-trust specific content, with relatively little in between. Which means the right threshold is usually fairly aggressive — accepting only the top quartile is often the right call for fine-tuning, especially below a few billion parameters. You'd think 'I need more data,' but you really need less, better.

Business value

  • Improves downstream model quality at a fraction of the cost of buying a better dataset. The 80/20 of dataset quality is filtering, not collecting.
  • Prevents site-template boilerplate from contaminating embeddings — the failure mode where every cluster in your vector store ends up about 'modern solutions.'
  • Creates auditable acceptance criteria you can point at when ML safety, legal, or compliance asks what's in your training set.

Agent job to be done

Be a data curator with a quality bar. Keep high-trust examples. Quarantine medium-risk for sample review. Reject generic and weak-provenance text — for training, the cost of one bad chunk amortizes over millions of inference calls.

format: articleintended_use: traindomain: training-data curation / RAG hygiene

Where this fits next to PII, copyright, and source-authorization checks

Slop filtering and PII filtering live at different stages of the pipeline. Run PII detection (Presidio or equivalent) on raw chunks before VeracityAPI scoring — you don't want to score chunks you're going to discard anyway. Copyright/source-rights checks happen earlier, at the crawl manifest level. VeracityAPI's role is the final quality gate before embeddings: 'we're allowed to use this; we've redacted PII; is it actually worth using?'

When to call VeracityAPI

During dataset construction, before embeddings, before tokenization for fine-tuning, before nightly RAG index rebuilds.

What text to submit

Document title, body chunk (typically 256–1024 tokens after chunking), source URL or document path, publication date, and any source-quality metadata you have. Store metadata in your pipeline; submit only the text plus context to VeracityAPI.

Decision policy

  • allow: low risk chunks enter the training/RAG corpus. Cost-effective at scale.
  • human_review: medium risk. Training is one-shot writing into the model; the threshold should be stricter than publishing.
  • reject: high risk, OR provenance_weakness ≥ 0.70, OR the chunk is from a source domain you've already flagged as low-trust.
  • Dedup combo: pair scoring with similarity filtering. Repeated generic chunks (site footers, recurring intros) will all score similarly bad — you only need to flag the pattern once.

Request template

The exact payload shape this use case sends. The sample below uses representative content for this workflow; substitute your own.

curl https://api.veracityapi.com/v1/analyze \
  -H "Authorization: Bearer $VERACITY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"type":"text","content":"Travelers need to be careful because scams can happen in many different places. It is important to research before you go and always use common sense.","context":{"format":"article","intended_use":"train","domain":"training-data curation / RAG hygiene"},"store_content":false}'

Automation recipe

  • Collect candidate documents from your sources (web crawl, internal docs, third-party datasets).
  • Strip boilerplate (readability extractor) and chunk by paragraph or fixed-token windows with overlap.
  • Score each chunk with intended_use=train. Use the batch endpoint when you have 5+ chunks from the same document.
  • Write content_trust_score, recommended_action, and evidence categories into your dataset manifest (JSONL is fine).
  • Only allow-tagged chunks export to the embeddings job or fine-tune run.

Evidence spans agents should inspect

  • 'generic_filler' — the educational-blog cadence that's overrepresented in web crawl data
  • 'unsupported_claim' — assertions without citation, which teach the model to make confident unsourced claims
  • 'site_boilerplate' — recurring text patterns that got past your readability extractor
  • 'low_information_density' — paragraphs that have high token count but low semantic content

A concrete example

Setup: A team building a customer-support assistant fine-tuned on 4M scraped help-center articles. Initial eval showed the assistant gave confident-but-wrong answers ~12% of the time, often citing 'the article' without specifics.

Result: After re-running the training set through a content_trust_score ≥ 0.65 filter, 38% of chunks were rejected. The retrained model's confident-wrong rate dropped to 4%, and its answers cited specific procedures and ticket numbers rather than generic 'consult your documentation' fallbacks. The filtered dataset was smaller, but it was the smaller dataset that worked.

Policy pseudocode

if (result.recommended_action === "allow") continueWorkflow();
if (result.recommended_action === "revise") rewriteWith(result.evidence, result.recommended_fixes);
if (result.recommended_action === "human_review") queueForHumanReview(result);
if (result.recommended_action === "reject") discardOrRebuild();

KPIs to track

  • dataset acceptance rate (typical healthy steady-state: 40–70% depending on source quality)
  • number of weak chunks removed per crawl batch
  • downstream eval improvement after fine-tune (compare allow-only vs. unfiltered baseline)
  • RAG answer specificity on a held-out probe set
  • manual data-review hours saved per million chunks

What can go wrong

  • This is not a full data-governance system. It does not check copyright, license rights, PII, or training-on-copyrighted-data exposure.
  • Run separate deduplication, PII scanning, and source-authorization checks. Slop filtering is one gate of several.
  • For domain-specific training (medical, legal, financial), tune your accept threshold higher than the default — generic web text contaminates these domains faster than general ones.

Cost and latency notes

Analyze only is $0.005 per 1,000 characters; Analyze + revise with auto_revise=true is $0.010 per 1,000 characters. Both round up to the nearest 1,000 characters. Short captions/emails usually cost $0.005; longer pages or chapters scale linearly by length. Current v0.1 latency is LLM-bound, so batch/concurrent orchestration is recommended for high-volume pipelines.

Agent evaluation checklist