Document AI Training Data

Generate realistic synthetic documents at scale.

33 document types. Controllable OCR degradation. Zero real PII. Built for the teams training the next generation of document AI.

Purchase — $199 → One-time purchase. Free updates. Runs locally — nothing leaves your machine.
court_document.pdf
STATE OF CALIFORNIA
LOS ANGELES COUNTY SUPERIOR COURT
━━━━━━━━━━━━━━━━━━━━━━━
ASHLEY VINCENT, Plaintiff,
Case No.: 2020-CV-3360
invoice.pdf
INVOICE #7291
Date: March 24, 2026
Subtotal: $12,450.00
Tax (8%): $996.00
TOTAL: $13,446.00
receipt_Corrupt_Etext.pdf — OCR: 15%
RÒB1Ñ$OIN BI<
Dsle: QE/7?|2026 7îlna.
Srih7c+a] $3zQ20
Yoy {/%> $7Z91
TQLÂI $8q2·07
OCR DEGRADATION ACTIVE — 15%
33
Document Types
300
Docs / Second (clean)
0
Real PII Generated
100%
Local — Nothing Uploaded

See the output before you commit.

Download 50 real documents generated by DocSet Generator — across multiple types, clean and corrupted. No email required. No strings attached.

  • PDF documents across 10+ types
  • Clean and OCR-degraded versions
  • Corrupt text layer examples
  • Native formats included (.docx, .eml, .csv)
  • 50 documents total, ready to inspect
↓ Download Free Sample Pack

// 50 documents · .7z archive · No signup required

Corrupted receipt showing OCR degradation at 15%
Receipt — OCR Degraded 15%
Clean receipt output
Receipt — Clean 100%
The Application

Clean interface. No configuration required.

DocSet Generator main interface
Main Interface 33 document types organized by category. Select individual types or entire families. Estimated output count updates in real time.
Completed generation run with output files
Generation Complete COMPLETE status, generation time, output folder open with timestamped files. 5 documents with Bates stamping, watermarks, and corrupt text layer — 1.3 seconds.
Performance

Fast enough to not be your bottleneck.

ModeDocumentsTimeWorkers
Clean (100% quality)330 docs1.1s8 workers
OCR Degraded (50%)330 docs11s8 workers
Corrupt Text Layer330 docs11s4 workers
Corrupt Text Layer (scale)3,300 docs104s4 workers

// Worker count scales automatically based on your CPU and workload type. No configuration required.

01 — DEGRADATION

OCR Corruption Slider

Control exactly how hard your model has to work. From clean ground truth to heavily degraded scans — continuous spectrum, not presets. Realistic character substitutions based on actual OCR failure patterns.

02 — STEALTH

Corrupt Text Layer

Clean visual page, corrupted hidden text layer. Forces OCR fallback on tools that read embedded text directly. Tests the gap between what a document looks like and what an extractor actually reads.

03 — LEGAL

Bates Stamping

Sequential review-style identifiers on every page. Essential for eDiscovery and legal document AI pipelines. Generates at scale without manual numbering.

04 — PIPELINE

Image-Only PDFs

Flatten documents to pure image — no selectable text layer at all. Tests OCR systems that can't rely on embedded text as a fallback. True visual-only extraction challenge.

05 — PRIVACY

Zero Real PII

All names, addresses, companies, SSNs, account numbers, and financial figures are synthetically generated. Mathematically accurate but entirely fake. Safe for any environment.

06 — LOCAL

Runs On-Prem

Windows desktop application. No internet required after install. No data sent to any server. Your training data stays on your machine — critical for regulated industries.

33 document types across every domain your pipeline will encounter.

General

  • Letter
  • Memo
  • Report
  • Fax
  • Meeting Notes
  • Scheduler
  • Transmittal

Business

  • Email
  • Mass CC Email
  • Invoice
  • Receipt
  • Check
  • Financial
  • Corporate
  • Presentation
  • Real Estate

Legal & Govt

  • Agreement
  • Court Document
  • Government
  • Patent
  • Certificate
  • Form

Data & Misc

  • Media
  • Documentation
  • Personal Info
  • Publication
  • Table / List
  • Transcript

Native Formats

  • Word (.docx)
  • Excel (.xlsx)
  • Email (.eml)
  • CSV (.csv)
  • Plain Text (.txt)

One price. No subscriptions. No usage limits.

Built for ML engineers and QA teams who need training data now, not after a procurement process. Buy once, generate as many documents as you need, keep every update.

Questions? Contact us at
hello@docsetgenerator.com

$199
One-time purchase · Windows
Purchase Now →

Or download the free sample first

Common questions.

Anything else? Email hello@docsetgenerator.com

Is any of the generated data real?
No. All names, addresses, companies, SSNs, EINs, account numbers, and financial figures are synthetically generated. The math is accurate but every entity is entirely fabricated. Safe to use in any environment.
Does it require an internet connection?
Only for the initial download. After that it runs entirely offline. No telemetry, no license servers, no API calls. Your generated data never leaves your machine.
What's the difference between OCR Degradation and Corrupt Text Layer?
OCR Degradation visually corrupts the document — characters are substituted, text becomes hard to read. Corrupt Text Layer keeps the document looking clean but corrupts the hidden embedded text, forcing tools to fall back to visual OCR. They can be used independently or together.
What OS does it run on?
Windows only currently. The application is a standalone .exe — no Python installation or dependencies required.
What happens when you release updates?
Updates are free for all existing customers. Re-download the latest version from your original purchase link anytime.
Can I use the generated documents for commercial ML training?
Yes. Documents generated by DocSet Generator are yours to use however you need, including as training data for commercial ML products.