I am a productivity and automation freak, and in this article, I will show you how to encrypt a PDF and redact sensitive information using a combination of Automator, AppleScript, and PDFpenPro. I will also show you how to apply Optical Character Recognition (OCR) and how to detect if a PDF already has an OCR layer using AppleScript.
Shop mentioned products
- Smile Software PDFpenPro (Smile Software)
My usual data processing workflow starts with scanning every document that lands on my desk. If I get the document electronically, I skip the scanning and go straight to applying OCR, if the PDF doesn’t have an OCR layer yet. I use an automation tool called Hazel to help with many of the tasks I describe below. OCR is necessary because it makes the PDF searchable. That’s a prerequisite before you can redact sensitive information.
Using PDFpen to apply OCR
Most of my Hazel rules look for certain keywords inside of the scanned PDF documents to identify what the document is about. For Hazel to do that, PDFpen first applies OCR to the document and thus makes its contents read- and searchable.
The cool thing about PDFpen is that you don’t need Hazel or AppleScript to do that. All it takes for PDFpen to do an OCR analysis is to open the PDF document and double-click on any word inside the document. Alternatively, you can use the Edit –> OCR Page… menu command. PDFpen then immediately analyzes the PDF and adds a so-called OCR layer.
Tipp: When saving an email or file as PDF (via File –> Print –> PDF command) on a Mac, OCR is automatically applied by macOS.
In the screenshot below, I applied OCR to an image file (JPG) and asked PDFpen to show me the OCR layer, or the text it recognized, via View –> OCR Layer (or Command + Shift + O). Once a document has an OCR layer, you can search for its contents using Spotlight, which makes it much easier to find stuff when you don’t recall the file name or its storage location.
AppleScript to detect OCR layer
For a while, I have struggled to find a simple solution that would allow me to detect if OCR has already been applied to a given PDF document. I finally figured it out and wanted to share the solution with you.
I have a Hazel rule that processes newly scanned PDF documents and applies Optical Character Recognition (OCR) via AppleScript using PDFpenPro. Unfortunately, I noticed that some documents were blank after OCR was applied. After some trial and error, I found out, that applying OCR twice may corrupt the PDF. Applying OCR twice can occur, for example, if the source PDF already had an OCR layer. If PDFpenPro then applies OCR again, it may corrupt the document. So I was looking for ways to detect if a document already has an OCR layer before calling PDFpenPro from within my Hazel rule.
My original AppleScript to apply OCR via Hazel and PDFpenPro looked like this:
You can download the script here.
A few extra lines of code check if the raw document contains the string “BaseFont,” which can typically only be found if the document has an OCR layer. If the return code is 1 (or false), the string couldn’t be found. I’m using a shell script and grep to search for the string, but because “1” is treated as an error return code, I had to wrap the call inside a “try” and “on error” block.
The resulting script looks like this:
You can download the script here.
Even if you don’t use Hazel, you can easily change the script to fit your needs. The full Hazel rule that processes scanned PDF documents looks like this:
How to redact sensitive information in a PDF
Encrypting sensitive data is one of the best ways to making sure your information stays secure. But encryption comes at a cost! Encrypted PDF files are no longer searchable via Spotlight and opening them requires a password. I would have to type in a password every time I opened an encrypted file because Keychain in macOS cannot be used to retrieve the password.
Smart Folders use search to automatically gather files by type and subject matter. Smart Folders are updated as you change, add, and remove files on your Mac.
So I decided to redact documents that contained very sensitive information such as my Social Security Number (SSN). I created a Smart Folder that listed all files containing the following numeric patterns, matching my actual SSN:
Then I let an Automator workflow redact those patterns from all files in the Smart Folder using Apple Script and PDFpenPro. In the workflow below, I just dragged and dropped all files listed in the Smart Folder into the “Get Specified Finder Items” action.
You can download the script here.
The problem with redacting text is that it relies on Optical Character Recognition (OCR) and a perfect OCR layer. For handwritten, or otherwise difficult to read documents, OCR has its limits. After looking at the results, I noticed that my workflow missed a couple of occurrences of my Social Security Number in some of the files. That was due to OCR not recognizing my SSN correctly or an offset between the OCR layer and the scanned document resulting in the wrong part of the document being redacted.
The need for data classification
As a result, I reconsidered just encrypting all files. But before I did that, I took the time to go through all files I had tagged as “Sensitive” and re-classify them as either “Strictly Confidential” or “Confidential.” I did that because I wanted to make sure that I only encrypted what needed this highest level of protection (files tagged as Strictly Confidential). The remaining files I would leave tagged (classified) but unencrypted. Ultimately only files containing information that could lead to identity theft or that would completely expose my financial situation I tagged as Strictly Confidential. That included such file as:
- Tax returns
- Mortgage contracts
- Anything else containing my full SSN
Re-classifying my data took a couple of hours, but it gave me the opportunity to start with a clean slate. To make sure it stayed that way, I updated my Hazel rules to apply classification tags to scanned data automatically.
How to encrypt a PDF
To encrypt PDF files without using a hard-coded password set inside the workflow, I decided to use sample code I found on scrubbs.me. This approach would use a password that I can store in Keychain to encrypt the data. For this solution to work, you have to open the Keychain Access app and create a new “Password Item.” Call it whatever you want to but make sure you use the same name in the Automator service (or workflow). I used “SuperSensitivePDFs” as the name of the Password Item in the example below.
I first created an Automator workflow to batch encrypt all files tagged as Strictly Confidential and then I converted the workflow into a Service for future use. So I could just right-click on the PDF file in Finder to encrypt it.
As a result, my most sensitive information is now encrypted using a strong encryption algorithm (AES 256) and a strong, randomly generated password. To make sure I don’t lose my encryption password, I also saved it in 1Password.
How to encrypt a PDF and redact sensitive information automatically
Redacting or encrypting confidential information is important to protect your personal information. Especially if you store your data with cloud storage providers that may get hacked or compromised. I like to automate as many manual tasks as I can. As a result, I am using tools such as Hazel, PDFpenPro, and Automator as part of my data processing workflow.
How to you protect your information and what automation tools do you use? Let me know by leaving a comment below!
Latest posts by Michael Kummer (see all)
- Tidy up your desk with HiRise 2 and BackPack by Twelve South - June 22, 2017
- Ample vs. Soylent: Comparison of meal replacement drinks - June 20, 2017
- Guest blog: Traveling a lot? Smart ways to store your luggage at home - June 15, 2017