Can a Good OCR Engine Save the Day for the DOJ?
Recently, the U.S. Department of Justice (DOJ) released the 400+ page Mueller Report (140MB PDF document). If you are interested in the workings of the U.S. government, you may have downloaded it, expecting to be able to educate yourself. Unfortunately, your attempts at searching for specific topics with the trusty Control-F key sequence ended in utter frustration. You may have just given up and looked for a summary write up, but others may not have that luxury.
In releasing this document, some critical requirements were missed. Ideally, a report should have been released in a native text PDF document, making it easily searchable and efficiently sized. The DOJ’s PDF, however, is an oversized image-based PDF that’s not searchable and not easily converted into a searchable document using most off-the-shelf OCR engines. In fact, the New York Times reported that it took 22 staffers hours to convert the pages into a searchable format.
So why would the DOJ redact sensitive information from the report in a document editor, print it, and then scan the whole thing back in as RGB color images to finally save in the PDF format? Perhaps no one knew any better, but we suspect that they were overly cautious due to some recent data breaches involving document redactions that weren’t true redactions.
One high-profile incident involved the Transportation Safety Administration (TSA), where sensitive information was visually redacted, but not completely redacted from the document. While it appeared the information was removed because it could not be seen, the sensitive data actually lived on in the document’s text layer. This report on the incident by the Department of Homeland Security provides more detail, and most likely led the DOJ to “take no chances” with the Mueller report by converting it into an image-based PDF.
Banning the Bloat
We need to quickly speak about bloat: this image-based PDF file weighs in at a hefty 140 megabytes (MB). You might be surprised to learn that there were no attempts to optimize the file, which by all rights should have been a much smaller, properly redacted PDF. While storage is becoming cheaper by the day, network and internet bandwidth haven’t become as cost effective.
OCR to the Rescue
Following the release of the report, multiple bloggers explained how to convert the image-based PDF into a searchable PDF. With a high-quality OCR engine, this conversion process is accurate, even with redactions. Plus there are PDF conversion solutions that allow users to not only convert, but also compress and optimize image-based PDFs to unlock their true potential.
Here are the results we got when we gave it a shot:
- We converted the 440-page image only report to a searchable PDF in under 10 minutes, compared to the hours spent by an army of staffers at the NY Times.
- Next reduced file size from 140 MB to 35 MB, one quarter of its original size. Redaction mark color-coding and photo image quality were retained (see image below).
Our PDF solution allows users to create true redactions, even utilizing U.S. Government FOIA and Privacy Act codes (or your own code set), by searching the document for sensitive keywords.
The moral of the story: there are elegant and easily accessible solutions for organizations creating and sharing sensitive information.
Avoid the missteps of the DOJ:
- If you are building a solution that would benefit from OCR, redaction and compression, learn more about our OmniPage SDK.
- Looking for desktop solutions? Learn more.