πŸ‹
Menu
How-To Beginner 1 min read 192 words

How to Extract Text From Scanned PDFs (OCR)

Scanned PDFs contain images of text, not actual text. Learn how OCR (Optical Character Recognition) can make scanned documents searchable and editable.

Key Takeaways

  • Optical Character Recognition (OCR) analyzes images of text and converts them into machine-readable characters.
  • If you can't select or search text in a PDF, it's likely a scanned image.
  • Several factors affect OCR quality:
  • OCR output often contains minor errors, especially with unusual fonts or low-quality scans.
  • Modern browser-based OCR uses WebAssembly-compiled engines (like Tesseract.js) that process documents entirely on your device.

What Is OCR?

Optical Character Recognition (OCR) analyzes images of text and converts them into machine-readable characters. When applied to a scanned PDF, OCR adds a hidden text layer behind each page image, making the document searchable and allowing copy-paste.

When You Need OCR

If you can't select or search text in a PDF, it's likely a scanned image. This is common with documents from older scanners, photographed pages, and PDFs created from fax transmissions.

OCR Accuracy Factors

Several factors affect OCR quality:

  • Scan resolution: 300 DPI minimum; 600 DPI for small text.
  • Image quality: Clean, high-contrast scans produce better results.
  • Language: Latin-script languages achieve 99%+ accuracy; CJK and handwriting are harder.
  • Font style: Standard printed fonts are recognized well; decorative fonts less so.

Post-OCR Cleanup

OCR output often contains minor errors, especially with unusual fonts or low-quality scans. Review the extracted text for common mistakes like confusing '1' with 'l', '0' with 'O', and misread punctuation.

Browser-Based OCR

Modern browser-based OCR uses WebAssembly-compiled engines (like Tesseract.js) that process documents entirely on your device. This means sensitive scanned documents never leave your computer.

κ΄€λ ¨ 도ꡬ

κ΄€λ ¨ 포맷

κ΄€λ ¨ κ°€μ΄λ“œ