Getting Optimal OCR Results from Sub-optimal Image-based Documents
One of the biggest challenges with document automation is the document itself. Organizations asked to process documents containing images are generations removed from when the ink was first put to paper. These documents may have been printed, faxed, scanned, then re-printed and scanned again. The results of this convoluted lifecycle are noisy, often unreadable images.
There’s a saying in the OCR industry, “garbage in, garbage out”. In short, it is very challenging to get accurate OCR output from documents that even humans struggle to read. Thankfully, due to advancements in computer vision, there are now solutions for these challenges.
Adaptive noise removal algorithms are the more intelligent grandchildren of the “despeckling” algorithms used previously. These algorithms no longer blindly remove “specks” over a given size from a document, but rather automatically detect the “speckiness” of a page and adjust on the fly. The result, as seen below, is the ability to remove extreme noise from some parts of a page without destroying content elsewhere.
In this example below, the original document likely had a background color behind the column of text on the left before it was crudely converted to black and white. This image, in its current form, is not suitable for consumption by an OCR process.
The image below shows what an adaptive noise removal algorithm can do with this input. The background shading was completely removed from the left column while the right column was processed separately with an erosion technique that makes the text more easily read. Now that the document has been cleaned, OCR operations will be more successful.
Another challenge facing the document processing industry today is sitting in your pocket. Mobile devices are quickly becoming the capture device of choice for many people. These smartphone photographs present a whole new set of problems that weren’t faced with traditional scanned images. Often, the documents being photographed aren’t flat or the camera isn’t in the correct position above the page to capture a good image. There are also issues with shadows being cast on the page from the phone or photographer.
Once again, computer vision comes to the rescue. With the use of 3D deskewing algorithms, software can correct the issues listed above. 3D deskewing can straighten out the lines of text in an image. This is a bit more advanced than traditional 2D deskewing, as it can deal with curved content as well as simple rotations. These algorithms can also correct for parallax distortion (the perspective effect) often found in camera images. Finally, 3D deskewing is often combined with advanced binarization algorithms that can deal with the localized shading caused by shadows. Below is an example of 3D deskewing in action.
The paper in the photograph below was not laid on a flat surface when the image was captured. You can see how the text bows across the page. Since OCR engines look for properly oriented characters, images like this will lead to sub-optimal results. You can also see a shadow cast across the page by the photographer. These shadows often confuse the algorithms used to convert color images to black and white for recognition.
After applying 3D deskewing, you can see that the distortion was removed. All the text lines are perfectly straight which will significantly improve OCR accuracy. Also, the binarization process used with the 3D deskewing was able to properly remove the shadows without any negative effect.
Getting exceptional OCR accuracy is both an art and a science. Many developers are not (nor want to be) OCR or document preprocessing experts. Organizations should look to companies that provide not only the tools required to handle the types of documents described above, but also the expertise to use them to their maximum effect. Recently, one of our partners struggled to extract the data from a very challenging identification card. With the algorithms described here and our assistance, we were able to reduce the error rate by 75%.
For more information on OmniPage and, more importantly, how the OmniPage presales team can assist in addressing your OCR needs, please visit OCR.com.