Synthetic OCR Training Data

Generate synthetic documents for OCR and document AI.

Desktop software for Windows and Linux. 33 document types. Controllable OCR degradation. Zero real PII. Runs entirely offline.

Purchase — $199 → One-time purchase. Free updates. Windows .exe & Linux AppImage / Flatpak.
Nothing leaves your machine.
Live corruption demo
receipt_Corrupt_Etext.pdf
OCR degradation 0%
0
Document Types
0
Docs / Second
(clean mode)
0
Real PII
Generated
0%
Local — Nothing
Leaves Machine

Built for the people who build document AI.

We needed 10k labeled documents in two days for a fine-tuning run. DocSet Generator got us there before lunch. The corrupt text layer mode is genuinely clever.
— Senior ML EngineerDocument Intelligence Startup
Every OCR pipeline I've tested has hidden assumptions about text quality. The degradation slider is the fastest way I've found to expose them.
— NLP Research LeadEnterprise AI Platform
The zero-PII guarantee matters enormously for our legal clients. We can generate realistic case documents for extraction testing without touching real data.
— QA Automation EngineerLegalTech Company

See the output before you commit.

Download 50 real documents generated by DocSet Generator — multiple types, clean and corrupted. No email required.

  • PDFs across 10+ document types
  • Clean and OCR-degraded versions side-by-side
  • Corrupt text layer examples
  • Native formats: .docx, .eml, .csv
  • 50 documents total, ready to inspect
↓ Download Free Sample Pack

// 50 documents · .7z archive · No signup required

Corrupted receipt — OCR degraded at 15%
Receipt — OCR Degraded 15%
Clean receipt output
Receipt — Clean 100%

Clean interface.
No configuration required.

Select document types, set your options, run. Estimated output count updates in real time. Status is always visible.

DocSet Generator main interface
Main Interface33 document types organized by category. Select individual types or entire families.
Completed generation run
Generation CompleteCOMPLETE status, generation time, and output folder — ready to open.
300
documents per second
clean mode · 8 workers
1.1s
to generate 330 documents
clean mode · tested locally

See it in action.

DocSet Generator — demo.mp4

Everything you need to stress-test
a document pipeline.

01 — Degradation

OCR Corruption Slider

Control exactly how hard your model has to work. From clean ground truth to heavily degraded scans — continuous spectrum, not presets. Realistic character substitutions based on actual OCR failure patterns.

02 — Stealth

Corrupt Text Layer

Clean visual page, corrupted hidden text layer. Forces OCR fallback on tools that read embedded text directly. Tests the gap between what a document looks like and what an extractor actually reads.

03 Legal

Bates Stamping

Sequential review-style identifiers on every page. Essential for eDiscovery and legal document AI pipelines. Generates at scale without manual numbering.

04 Pipeline

Image-Only PDFs

Flatten documents to pure image — no selectable text layer. Tests OCR systems that can't rely on embedded text as a fallback. True visual-only extraction challenge.

05 Privacy

Zero Real PII

All names, addresses, companies, SSNs, account numbers, and financial figures are synthetically generated. Mathematically accurate but entirely fake. Safe for any environment.

06 Local

Runs On-Prem

Windows .exe or Linux AppImage / Flatpak. Tested on Ubuntu. No internet required after install. No data sent to any server. No telemetry, no license checks.

33 document types.

Every domain your pipeline will encounter — general office, business finance, legal and government, data formats, and native file types.

General 7 types
  • Letter
  • Memo
  • Report
  • Fax
  • Meeting Notes
  • Scheduler
  • Transmittal
Business 9 types
  • Email
  • Mass CC Email
  • Invoice
  • Receipt
  • Check
  • Financial
  • Corporate
  • Presentation
  • Real Estate
Legal & Govt 6 types
  • Agreement
  • Court Document
  • Government
  • Patent
  • Certificate
  • Form
Data & Misc 6 types
  • Media
  • Documentation
  • Personal Info
  • Publication
  • Table / List
  • Transcript
Native Formats 5 types
  • Word (.docx)
  • Excel (.xlsx)
  • Email (.eml)
  • CSV (.csv)
  • Plain Text (.txt)

One price.
No subscriptions.
No usage limits.

Built for ML engineers and QA teams who need training data now, not after a procurement process.

Buy once. Generate as many documents as you need. Keep every update.

Questions? hello@docsetgenerator.com

$199
One-time purchase · Windows & Linux
  • Windows .exe · Linux AppImage / Flatpak
  • 50-document sample pack included
  • 33 document types across 5 categories
  • OCR degradation slider (0–100%)
  • Corrupt text layer mode
  • Bates stamping + watermarks
  • Image-only PDF generation
  • Adaptive parallel processing
  • Free updates — re-download anytime
  • Runs fully offline — no data leaves machine
Purchase Now →

Or download the free sample first

Common questions.

Anything else? hello@docsetgenerator.com

Is any of the generated data real? +
No. All names, addresses, companies, SSNs, EINs, account numbers, and financial figures are synthetically generated. The math is accurate but every entity is entirely fabricated. Safe to use in any environment.
Does it require an internet connection? +
Only for the initial download. After that it runs entirely offline. No telemetry, no license servers, no API calls. Your generated data never leaves your machine.
What's the difference between OCR Degradation and Corrupt Text Layer? +
OCR Degradation visually corrupts the document — characters are substituted, text becomes hard to read visually. Corrupt Text Layer keeps the document looking clean but corrupts the hidden embedded text layer, forcing tools to fall back to visual OCR. They can be used independently or together.
What OS does it run on? +
Windows and Linux. Windows ships as a standalone .exe; Linux ships as an AppImage or Flatpak. Tested on Ubuntu. No Python installation or dependencies required on either platform.
What happens when you release updates? +
Updates are free for all existing customers. Re-download the latest version from your original purchase link anytime.
Can I use the generated documents for commercial ML training? +
Yes. Documents generated by DocSet Generator are yours to use however you need, including as training data for commercial ML products.
ESC / CLOSE ×

Tweaks

Accent color
Corruption demo speed
Background tone
CLOSE ×