Converting an HTML document to PDF doesn't have to be difficult. In fact, it's pretty easy if you're building that functionality into a Python app.
Invoking Headless Chrome to Print to PDF
By far the easiest method for printing HTML to PDF programmatically is to use a headless Chrome browser.
If you haven't played with headless Chrome before, it's basically the Chrome you're used to, but launched without a GUI. Typically the browser is controlled programmatically. This is commonly done to perform automated testing, or to scrape websites. In our case, we'll instruct Chrome to print a document to PDF and then immediately exit.
There are other tools for HTML conversion, such as wkhtmltopdf, but in my experience I've found Chrome to be the easiest to work with.
Because Chrome will do the heavy lifting, we only need to worry about invoking it properly, and making sure files are where the browser expects them. The following is a simple utility function for invoking the Chrome browser in headless mode.
import subprocess
import tempfile
def html_to_pdf(html_file_path):
pdf_file = tempfile.NamedTemporaryFile(delete=False, suffix=".pdf")
pdf_file.close()
command = [
"google-chrome-stable",
"--headless",
"--no-sandbox",
"--disable-gpu",
"--print-to-pdf=" + pdf_file.name,
html_file_path,
]
process = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
process.communicate()
return pdf_file_path
Dedicated CSS for the Printed Version
Making sure your content fits onto the PDF and isn't clipped off is one potential stumbling block.
We can target CSS specifically for the printed page using the @page rule. You might have to experiment here, but the size property can be adjusted to control the page width and height.
Here is an example of using the @page rule.
@page {
size: 13in 6.5in;
}
How content wraps from one page to the next is also an issue unique to the printed PDF version.
As an example, if you have a list of products and you'd like to make sure each page has a single product, you can use the break-before property to accomplish this.
.product {
break-before: page
}
You can also do the opposite if you want to ensure that some content is never broken across pages. You might have a block quote, for example, that should always be displayed on one page. The break-inside property controls this behavior.
.blockquote {
break-inside: avoid;
}
Bonus Points: Dockerizing the Process
What we've done so far works fine for a one-off script. But what if we want to incorporate this into a deployed application? The biggest problem is ensuring that Chrome is available in the application environment.
Assuming we're working with Docker, we can write a Dockerfile that will take care of installing Chrome and its dependencies.
FROM python:3.9
ENV DEBIAN_FRONTEND=noninteractive
# Install Chrome and its dependencies
RUN apt-get update \
&& apt-get install -y --no-install-recommends \
wget \
ca-certificates \
apt-transport-https \
gnupg \
&& wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add - \
&& echo "deb [arch=amd64] https://dl.google.com/linux/chrome/deb/ stable main" > /etc/apt/sources.list.d/google.list \
&& apt-get update \
&& apt-get install -y --no-install-recommends \
google-chrome-stable \
# Cleanup
&& apt-get purge --autoremove -y wget \
&& rm -rf /var/lib/apt/lists/* \
&& rm -rf /etc/apt/sources.list.d/google.list
You can combine this Dockerfile with whatever extra dependencies you have, such as Python packages installed through a requirements.txt file. We explicitly leave out the CMD argument so that the container can be launched with whatever command is necessary. Chrome will be available within the container, and the html_to_pdf function should work properly.
If you're incorporating this conversion into a Django app, you might want to consider doing the conversion as part of a Celery task.
One or two page documents can be converted very quickly, but long documents may cause too much burden for your web server and are better off-loaded to a Celery worker.