Generate synthetic documents for OCR and document AI.
Desktop software for Windows and Linux. 33 document types. Controllable OCR degradation. Zero real PII. Runs entirely offline.
Nothing leaves your machine.
(clean mode)
Generated
Leaves Machine
Built for the people who build document AI.
See the output before you commit.
Download 50 real documents generated by DocSet Generator — multiple types, clean and corrupted. No email required.
- PDFs across 10+ document types
- Clean and OCR-degraded versions side-by-side
- Corrupt text layer examples
- Native formats: .docx, .eml, .csv
- 50 documents total, ready to inspect
// 50 documents · .7z archive · No signup required
Clean interface.
No configuration required.
Select document types, set your options, run. Estimated output count updates in real time. Status is always visible.
See it in action.
Everything you need to stress-test
a document pipeline.
OCR Corruption Slider
Control exactly how hard your model has to work. From clean ground truth to heavily degraded scans — continuous spectrum, not presets. Realistic character substitutions based on actual OCR failure patterns.
Corrupt Text Layer
Clean visual page, corrupted hidden text layer. Forces OCR fallback on tools that read embedded text directly. Tests the gap between what a document looks like and what an extractor actually reads.
Bates Stamping
Sequential review-style identifiers on every page. Essential for eDiscovery and legal document AI pipelines. Generates at scale without manual numbering.
Image-Only PDFs
Flatten documents to pure image — no selectable text layer. Tests OCR systems that can't rely on embedded text as a fallback. True visual-only extraction challenge.
Zero Real PII
All names, addresses, companies, SSNs, account numbers, and financial figures are synthetically generated. Mathematically accurate but entirely fake. Safe for any environment.
Runs On-Prem
Windows .exe or Linux AppImage / Flatpak. Tested on Ubuntu. No internet required after install. No data sent to any server. No telemetry, no license checks.
33 document types.
Every domain your pipeline will encounter — general office, business finance, legal and government, data formats, and native file types.
- Letter
- Memo
- Report
- Fax
- Meeting Notes
- Scheduler
- Transmittal
- Mass CC Email
- Invoice
- Receipt
- Check
- Financial
- Corporate
- Presentation
- Real Estate
- Agreement
- Court Document
- Government
- Patent
- Certificate
- Form
- Media
- Documentation
- Personal Info
- Publication
- Table / List
- Transcript
- Word (.docx)
- Excel (.xlsx)
- Email (.eml)
- CSV (.csv)
- Plain Text (.txt)
One price.
No subscriptions.
No usage limits.
Built for ML engineers and QA teams who need training data now, not after a procurement process.
Buy once. Generate as many documents as you need. Keep every update.
Questions? hello@docsetgenerator.com
- Windows .exe · Linux AppImage / Flatpak
- 50-document sample pack included
- 33 document types across 5 categories
- OCR degradation slider (0–100%)
- Corrupt text layer mode
- Bates stamping + watermarks
- Image-only PDF generation
- Adaptive parallel processing
- Free updates — re-download anytime
- Runs fully offline — no data leaves machine
Or download the free sample first