How Make PDF image text searchable with OCR? 3 Free Ways

You obviously have a lot of trouble copying text from PDFs. This is not usually because the document’s creator has disabled the “copy and paste” functionality, so how to make Pdf image text searchable with OCR?

Just opening the relevant Document in Adobe Reader and selecting the Properties option from the File menu will reveal whether or not this security measure has been implemented. You must enable Copy content in the security settings.

If you can make copies of the PDF’s contents, you will likely deal with a PDF document comprising scanned images of paper pages. Is there a way to make a PDF searchable? In other words, how can we facilitate the usage of the Edit, Find tool to facilitate the rapid location of a word or phrase in the document?

Making a PDF searchable requires an optical scan of the characters and an automated recognition of those characters (OCR). Adobe Acrobat is the most effective tool for optical character recognition (OCR) and searching within PDF documents since it can convert a PDF into a searchable PDF. Users using Acrobat can follow the easy-to-understand guidelines provided by Adobe:

Is there a workaround for those who do not have access to Adobe Acrobat Standard? In a word, yeah. For those on Windows, PDF-XChange Reader is a fantastic option. Convert non-searchable PDF files into searchable ones by using an OCR program.

XChange Viewer

If you’re still using Windows, you can download PDF-XChange Viewer, which includes an OCR tool that can be used with any PDF file. Even though it employs a “secret” OCR engine that can’t be accessible online, the program is nevertheless rather effective at making PDFs searchable.

A PDF file processed using PDF-XChange is not “heavy” in any way. If you want to try PDF-XChange Viewer right now, all you have to do is click here and install the software. It’s possible that you’ll want to disable browser plugins and updates throughout the installation process:

After installing PDF-XChange and adding the files for OCR, you can make it a “portable” application by simply copying the contents of the installation folder (typically,%program files%Tracker SoftwarePDF Viewer) and pasting it somewhere else.

Launch PDF-XChange and then choose Scan text pages with OCR from the Document menu to convert a PDF file

When PDF-XChange has completed its OCR process, you can save the converted document by selecting File > Save As and giving it a new name.

When you view a PDF that has been processed with PDF-XChange in Adobe Reader (or another PDF file manager), you’ll find that the text is completely selectable; you can “copy and paste,” and you can search for any term or phrase inside the PDF.

When opened in Adobe Reader, the PDF-XChange Viewer-converted file contains no discernible text. Copying and pasting from a PDF-XChange-converted, OCR-processed Adobe Reader document is error-free.

Nevertheless, if Adobe Reader doesn’t let you search for terms in PDF-XChange-converted documents and instead shows you a notice saying, “Reader has finished the search inside the document,” then PDF-XChange may have broken the search functionality. When the error message “No match was discovered,” you can fix the problem by removing everything in the folder “C: UsersUSERNAMEAppDataLocalLowAdobeAcrobat11.0Search” (there may be numerous files with the .idx extension ).

Adobe Reader keeps a database-like index of your most frequently used PDFs in that folder, which is where the issue originates. The software may become confused if the converted PDF file from PDF-XChange shares the same ID as the original PDF file. If Adobe Reader’s “Find” function (Edit, Find, or CTRL+F menu) stops working, removing the contents of the Search folder will restore it to full functionality.

Tesseract Optical Character Recognition, Google, and Google Drive

Google began using Tesseract OCR in its many web-based offerings after reviving the project in 2006 (Google revives Tesseract OCR) (read Drive; Google Docs: the improvements applied to the OCR functionality and Google Drive on Android: scanning of documents and OCR ).

The issue is that we do not think Google’s free OCR technology is quite ready for prime time. Even if you are able to upload PDFs to Google Drive, using the option to “Convert text from PDF file or image file to Google Docs format” will result in a new page being added to the document with the recognized text, which will not adhere to the overall layout of the original document.

However, as of the time of this article’s publication, Google Drive does not support the character recognition and subsequent addition of a new layer required to convert a PDF into a searchable PDF.

Scanning and recognizing text with Linux

Pdfocr is a script written by programmer Geza Kovacs (and featured in this thread) to convert a PDF into a searchable PDF. The Ruby-based script was put through its paces on the Linux Mint operating system.

Launch a terminal and enter the following instructions to ensure the script runs properly on your Linux machine:

command:

sudo add-apt-repository ppa:gezakovacs/pdfocr
sudo apt-get update
sudo apt-get install pdfocr

pdfocr is a script that has to be placed in the /usr/bin/ folder. This may be confirmed by using the following command on the terminal:

command:

dpkg -L pdfocr

Then enter:

command:

sudo mv /usr/bin/pdfocr.rb /usr/bin/pdfocr.ori
sudo wget http://www.techportal.it/dl/pdfocr.txt
sudo mv ./pdfocr.txt /usr/bin/pdfocr.rb
sudo chmod +x /usr/bin/pdfocr.rb:

Pdfocr employs the Tesseract OCR engine by default.

You may still try to recognize the characters using the Cuneiform or Ocropus engines by including the -c or -p switch.

The following terminal commands can be used to check for missing packages:

Do the following

Commands:

sudo apt-get install tesseract-ocr
sudo apt-get install cuneiform

While the pdf or-based technique isn’t flawless, it allows the PDF to be converted into a searchable PDF, making it easier to find certain information.

To test, you should set up a virtual computer with Linux Mint or another Ubuntu-based distribution (created, for example, with Virtualbox). Use the steps outlined in the article Accessing Linux partitions from Windows: sharing folders and file systems to move PDF files between your Windows and Linux computers.