Knowledge of Optical Character Recognition (OCR)
Optical Character Recognition, usually abbreviated to OCR, is the mechanical or electronic translation of scanned images of handwritten, typewritten or printed text into machine-encoded text.
Why use OCR?
OCR is widely used to convert different types of documents, such as scanned paper documents, PDF files or images captured by a digital camera into editable and searchable data. In some professional environments (such as libraries, offices), thousands of books and documents are scanned regularly for back-up and archive. A scanner merely takes photographs of the original paper documents, resulting in image-based scanned documents in PDF format. The major issue with processing and storing such large volumes of scanned documents is the inability to search for a specific phrase or name inside a file. Also no text can be highlighted, copied, or modified, because the document contains one big image file as opposed to individual text characters.
Before performing OCR, the entire area on the page is selected and highlighted and no text can be searched and edited.
After performing OCR, text on the page can be selected with selecting tool, you can search and edit character, word, and paragraphs easily.
How do Wondershare PDF OCR tools help you?
Wondershare PDF OCR tools can help you recognize text from scanned PDF fast and accurately and preserve the recognized results in multiple editable formats.
Wondershare PDF Editor Pro for Mac: with outstanding OCR accuracy and format preservation, enable you to search, correct, and copy text in a scanned or image-based PDF directly on Mac. It also allows you to export scanned PDF to formatted text based Word, Excel, PowerPoint, EPUB, HTML, and Text formats.
Wondershare PDF Converter Pro: recognize text from scanned PDF with outstanding OCR accuracy and can convert multiple scanned PDFs to text-based Word, Excel, PowerPoint, EPUB, HTML, and Text documents on Windows.
Wondershare PDF Converter Pro for Mac: recognize text from scanned PDF with outstanding OCR accuracy and can convert multiple scanned PDFs to text-based Word, Excel, PowerPoint, EPUB, HTML, and Text documents on Mac.
How to improve OCR recognition quality?OCR recognition quality depends largely on the quality of the image, which greatly depends on the settings used during the document scanning process. In order to get better OCR recognition quality for your scanned documents, here are some tips for document scanning:
Font Is Too Small
For optimal recognition results, scan documents printed in very small fonts at higher resolutions.
You can specify the desired resolution in the Resolution property of the ScanSourceSettings object.
|Source image||Recommended resolution|
|300 dpi for typical texts (printed in fonts of size 10 pt or larger)|
|400-600 dpi for texts printed in smaller fonts (9pt or smaller)|
You may need to adjust the brightness setting when scanning in black-and-white mode. You can specify the desired brightness in the Brightness property of the ScanSourceSettings object. A medium value of around 50% should suffice in most cases.
If the resulting image contains too many "torn" or "stuck" together letters, troubleshoot using the table below.
|Your image looks like this||Recommendations|
|This image is suitable for recognition|
characters are "torn" or very light
characters are very distorted, stuck together, or filled out
Poor-quality documents with "noise" (i.e. random black dots or speckles), blurred and uneven letters, or skewed lines and shifted table borders may require specific scanning settings. For example, this fax and newspaper:
Poor-quality documents are best scanned in grayscale. When scanning in grayscale, the program will select the optimal brightness value automatically.
Grayscale mode retains more information about the letters in the scanned text to achieve better recognition results when recognizing documents of medium to poor quality.