In this post I am going to document the steps I took to implement a fully
automated deployment of my blog using GitHub Actions and GitHub Pages.

As always, I started my journey with the definition of what I really wanted to
get at the end:

The website is published on GitHub pages

Since the website is static and all of its content can be easily downloaded
using a web crawler (like wget --mirror https://website.tld) I was OK
with exposing the structure in the public repository, which is what GitHub
offers on a free plan.

The code to generate the website should be private

I do a lot of work on the SSG (which is Pelican in my case) itself: extend
it with plug-ins that may contain API tokens to reach out to some third
party APIs, hack the core code when I want to quickly test stuff, etc. –
so, I really did not have any desire to publish publicly all the commotions
I did in the background (sometimes I do more than a hundred commits per day
just to experiment with different ideas I have).

There should be a valid history of changes in both repositories

Well, I would get the history on my private repository for free, since it
is the core value of maintaining a repository in the VCS, but I also wanted
to have clean history of changes to the content I publish publicly.

It would be a pleasant bonus if the changes in the public repository could
refer back to the corresponding commit in the private repository.

One may say that to do what I set out to do I would need to subscribe for a
paid account with GitHub since according to their help page GitHub Pages
for private repositories are only available on the paid plans.

However, as I pointed out above, it does not make sense to hide the content of
the actual static website, hence all I needed to do is to find a way how to
“publish” the resulting artefact to the GitHub Pages repository, and,
preferably, that “publishing” should happen on GitHub’s side.

Luckily for me, GitHub started to support GitHub Actions on the free plan some
time ago and as long as it is not abused according to their terms and
conditions, it is a perfect vehicle for what I am trying to do, in my opinion.

Setting up GitHub Pages

There are multiple howtos and tutorials in the Internet on how to set GitHub
Pages up, including the official help section on this topic, so I will
only elaborate on details where I did something specific for the purposes of
achieving my goals.

There are different types of GitHub Pages:

user or organisation

per-project

The difference between two is subtle (the former requires a dedicated
repository for your website, while the latter allows you to keep it in a branch
of the existing repository), but for the purposes of this article I am assuming
that we are working with the user level GitHub pages which are residing in the
repository named “<username>.github.io” (where <username>
is your GitHub user name) as per the official documentation.

A few caveat I found and spent some time solving after following the official
documentation a listed below:

GitHub’s documentation assumes use of Jekyll for site generation.

It is not obvious how to use a different SSG (like Pelican). As far as I
understand there are multiple triggers for GitHub to consider that the
web site is in “published” state, so just ignore any references to Jekyll
in the documentation: you will trip one of the triggers sooner or later,
for example by pushing HTML files into your repository.

Configure your DNS before setting the custom domain name in GitHub Pages.

Pushing the CNAME file with the name of your custom domain within will
trigger a DNS check from GitHub to see that your custom domain name is
pointing back to GitHub Pages.

DNS heavily relies on caching and depending on the TTL settings in your
zone if a negative check is performed (that is, when GitHub fails to
retrieve the corresponding record) you will likely need to wait for quite a
while for GitHub to retry.

Setting up the CNAME record in advance and then verifying it with a query
before you commit the CNAME file to your repository ensures that you
will get the quickest validation response from GitHub, e.g. I set up my
CNAME records and then verified it from the command line (before)
submitting the request to GitHub:

[user@localhost ~]$ host -t cname dmitry.khlebnikov.net 8.8.8.8

Using domain server:

Name: 8.8.8.8

Address: 8.8.8.8#53

Aliases:

dmitry.khlebnikov.net is an alias for galaxy4public.github.io.

There are some shenanigans with the “Enforce HTTPS” option.

It is not obvious from the documentation, but the enforcement of HTTPS for
custom domains on GitHub’s side is dependent on the several things:

before the checkbox is enabled your custom domain name should be
confirmed by GitHub (your CNAME file is in place and the repository
settings show that the name was recognised);

the CNAME record should point to your “<>.github.io.”
DNS record (or, you can point it directly to GitHub Pages IP addresses
if you want to conceal the repository name in the DNS output);

if GitHub did not like something and you adjusted anything in the above
dot points the only way to trigger the enforcement of HTTPS is to
re-submit the CNAME file to the repository (yes, you read it right:
you need to delete the file and push it to the repository again);

Removing the CNAME file from the repository is a disruptive action –
the site will not be accessible for the duration of the file being
missing.

OK, you have your public repository configured the way you want, so let’s look
at the settings we need to be able to publish our code to this public
repository.

When I try to automate something, I usually start with writing down manual
steps I would do to achieve the results. This helps me to see patterns and to
understand what I can easily automate and what will require some brain-storming
to resolve.

In the case of updating the repository it is quite trivial: if I were to push
updates manually all I need is a private SSH key with the corresponding public
SSH key configured with write privileges for the repository and I could push
with git push from my local copy of the repository.

… private keys are called “private” for a reason – they are not supposed
to leave the device under any circumstances. […] please pay attention when
you read of hear somebody advising you to upload your private keys somewhere,
it is usually bad advice.

My private keys are called “private” for a reason – they are not supposed to
leave my device(s) under any circumstances (except for backup purposes such as
storing them in a safe). So, please pay attention when you read or hear
somebody advising you to upload your private keys somewhere, it is usually
bad advice.

For the integration purposes, GitHub provides so-called “Deploy keys” and
“Personal access tokens”. The former is just an SSH key pair associated with a
particular repository (you can configure it in repository’s setting) while the
latter is an OAuth access token associated with your account.

While you can successfully use both, I would recommend to use the “Deploy keys”
only, since despite that you can try to scope access down for a personal token
it would not be good enough and the actions performed using that token will
look like you are executing them.

Here, I chose the ed25519 key type since it is the shortest from the
GitHub supported key types at the moment, yet it is strong enough.

I also made the key pair passphrase-less (-N '') since the purpose of the
key pair is to automate things in the unattended fashion and there will be
no one to type in the passphrase.

The key pair comment just makes it easier to maintain your keys, but optional.

Finally, the -f ~/gh-action option specifies where the generated private
key is going to be stored. The public counterpart will use the same path with
the .pub suffix appended to it.

Set the newly generated public key up as the “Deploy key”:

All you need to do is to go to the repository setting for the public
repository you created for GitHub Pages, click on “Deploy keys” in the
left side menu, then click on the “Add deploy key” button in the upper
right corner.

On the next page, provide a sensible description for the deploy key (I
used the same text as I put into the keys comment, i.e. “Updating the blog
from GH Action”) and copy and paste the recently generated public key.
GitHub does not allow you to upload files over there, so you need to copy
the content of the public key file and paste it into the form, e.g.:

NOTE: You need to ensure that you tick the “Allow write access” checkbox,
otherwise we would not be able to push to the repository with the
corresponding private key.

This, actually, concludes the configuration of the GitHub Pages repository for
now – in later articles I will document how one could leverage the repository
Issues for managing comments on the web site and maintain the counters for
likes on the pages, but it would be a completely separate post :).

Setting up the private, code repository

A typical Pelican repository layout is quite simple and comprises of one
mandatory directory, one semi-mandatory file, and everything else is optional,
but could be used to enhance your experience.

The mandatory directory is the so-called
“content”
directory (in Pelican’s terms). The name of the directory can be anything you
want, but it is better be reflected in the PATH =
directive of the
setting file.

I am saying “better be” since Pelican can operate without any configuration
files, but the result will be limited, hence I call the pelicanconf.py file
(which is the default name for the configuration file) to be “semi-mandatory”.
The name of the configuration file can be also anything you like, however, I
suggest to stick with the default for now.

Basically, you can quickly start by following the Pelican documentation, by
doing something as follows:

The output from elinks was truncated on purpose since I just wanted to
showcase that Pelican has indeed generated the structure for a static website
from just one article file we created.

Before we push our local repository to GitHub we may want to do some house
keeping first, such as create the .gitignore file and list the temporary
things we do not want Git to track. A good enough version of the .gitignore
file I am using for my code repository is the following:

*~

*.pyc

.*.swp

**/__pycache__

/output

Do not forget to actually commit that .gitignore file to your local
repository using the git add .gitignore && git commit -m 'Added .gitignore',
by the way.

Now, we need to create a private repository on GitHub, so jump into your
browser, go to your GitHub account, press the “+” icon in the upper right
corner (right next to your profile icon), and select “New repository”.

On the “Create repository” page put whatever you desire as the name and the
description of the repository you are about to create. Ensure that the
“Private” radio button is selected and uncheck the “Initialize this repository
with a README” if it was checked.

Once the repository is created, you will be presented with a page that
enumerates your options for the next step, but I will just go ahead and show a
session dump of what you will need to do. In the following session snippet
blog is the repository name I chose for my private code repository and you
will need to replace it with your private repository name (the working
directory is our newly created local repository):

Do you remember how we generated a deploy key pair earlier and installed the
public key part into the public blog repository, so GitHub would allow the
bearer of the private key to authenticate and deploy changes to the public blog
repository? Well, since the purpose of this article is to introduce the full
automation, the bearer of the key would be the GitHub Action associated with
the private repository, hence we need to provide the action with the private
key somehow.

GitHub has a feature called “repository secrets” and it is a perfect candidate
to pass the private key to the GitHub Action. We need to follow the official
documentation for the feature and create a secret called “DEPLOY_KEY” with the
content of the private part of the deploy key. This will be used in the last step
of the GitHub Action we are about to define.

Configuring the GitHub Action for publishing

Everything is well and good, but “where is the automation?” you may ask. After
all, I suspect this was the primary reason you are reading this post. Well, we
are about to start to look into the automation part and it is rather short in
comparison to all the steps we did to set repositories up.

Our automation relies on the GitHub Action feature of GitHub. In plain terms,
GitHub Action is free compute resource provided by GitHub (there are some
limits, but for the purposes of a personal blog it is unlikely that you will
ever hit these limits).

Each GitHub Action is associated with a specific repository and is defined
using quite a simple YAML configuration file that instructs GitHub on how to
provision a required compute environment and what to run inside that
environment. The YAML file can be arbitrarily named and resides in the
.github/workflows/ subdirectory (starting from the root of the corresponding
repository).

The GitHub Action I am using for my blog web site is stored in
.github/workflows/pelican.yml and contains the following (we will dissect it
further down the post):

This is a copy of my live GitHub Action for deploying my blog that you are most
likely reading right now, and I decided not to edit anything, so if you just
want to re-use it you will need to replace a few things, namely:

en_AU.UTF-8 => to a locale you are using (you can run locale -a if you
are running Linux to see the list of locales available on your system);

content => you may need to change that to the name of your content
directory (if you did not use the default name);

themes/mind-drops/content => you will need to drop this line since it is
my theme’s content directory and you would not have it;

This being sorted, let’s look a bit more closely to understand how this GitHub
Action is structured and what each step is doing.

It all starts with the definition of the action itself and the conditions of
how it is triggered and how does it run, you can get a formal description of
the YAML structure of this configuration file in the official GitHub
documentation on Workflows.

Here, we are only going to focus on steps defined under the “jobs:” section of
the file since these steps are defining the logic we are after.

The “Initialise locale” step is quite important for Pelican since with a
misconfigured locale Pelican tends to produce incorrect output (which is kind
of expected). So in this step we are trying to determine whether the user (us
:) ) has supplied the LANG variable and if they did we update
/etc/locale.gen file, then run the locale-gen command to update the
corresponding files, and finally we set the locale of the container to the
requested locale.

The “Checkout the primary repo” step is leveraging the official “Checkout V2”
Action and checks out a full copy of the source code repository of our blog
and all the linked submodules. Initially, I was using a shalow copy using
fetch-depth: 1, but the next step was requiring the full repository history
to do its job reliably and I changed it to be a full history clone.

Since gitdoes not store timestamps for the files and directories under its
control, yet Pelican relies on timestamps to populate the modification time of
the artefacts – we need to find a way to reconstruct at least file timestamps
after the tree was checked out. One of the possible approaches would be to
create a plugin that could determine whether we are inside a git working tree
or not and depending on that apply different timestamp extraction policies, but
I thought that a much easier way would be to prepare the checked out tree,
hence making it compatible with the way Pelican expects things to be.

The “Restore modification times for content” step is my variant of how one
could reconstruct the timestamps for files close enough to make it possible to
use with Pelican. The approach relies on the fact that git records the
timestamp of each commit including adding, updating, and deleting files. We
create a list of all these file events using git log for file trees under
“content” (where our blog content lives) and “themes/mind-drops/content” (where
my custom theme injects some content such as the Web service worker script),
then we use sed to filter and to re-arrange the output a bit, followed by
reverse sorting to help to remove the entries that were introduced and later
deleted. In the end, we have a list of file names with timestamps, so we go
through the list in a loop and set the timestamps to files using touch.

The “Checkout Pages repo” step is cloning the public blog repository into the
“output” directory where Pelican will put the generated files. This is needed
to ensure that we can track the changes to the public repository, since Pelican
is careful enough (if not instructed otherwise) only to update the files it
generates and leave everything else in place as is. We use this later to
determine whether any new content has been generated or not.

The “Set up Python” and the “Install dependencies” steps are pretty generic:
the former is using the official GitHub Action to install and configure the
latest available version of Python 3.x and the latter is leveraging pip to
install all blog’s dependencies (including Pelican itself).

The “Generate the website” step is running pelican to process our articles
and pages and to generate the result in the “output” directory. There are a couple of tricks with this step, though.

The first trick, which is not that obvious is that we are removing the content of the “output” directory. It seems a bit weird since we just checked it out several steps before, is not it? Well, we are removing everything except hidden files and directories, which happen to contain the “.git” subdirectory with all the actual data about the repository. Why do we do it? It is simple, this helps us to determine a situation if some file or directory was removed, so we could propagate this knowledge to the public blog repository. If we did not clean up the content of the “output” directory we would only append new changes and would never remove anything – this is how it was before I stumbled upon the problem, by the way. :)

The second trick of the “Generate the website” step is the extraction of the
time zone information from the configuration files and is setting the TZ
variable correctly just before we call pelican. Without this either Pelican
may fail or if it does not it will produce UTC based date and times, which
would be undesirable (at least for me, since my time zone is in Australia).

The final step is to push the updated content to the public blog repository,
which will make it visible via GitHub Pages. Several things to notice there
are:

In the env: section we are setting up the DEPLOY_KEY variable – this
syntax is used to retrieve a named secret value from the associated secret
key for a repository. We store the private part of the deploy key we
specifically generated for this purpose at the beginning of this article
in the private repository secret.

git status is used to determine whether there are any changes between
what we have in the working tree and the repository index. If no changes
were detected we just exit gracefully.

If any change to the generated content was detected, we temporarily load
the private part of the deploy key into the ssh-agent (for 5 minutes),
push changes to the public blog repository, then clean up the key from
the agent and kill the agent itself.

From this point on, any push to the private codebase repository will trigger
the GitHub Action and if the change has resulted in any updated content such
content will be published to GitHub Pages!

There are quite a few thing we could improve such as introducing the broken
links check, doing some sanity checks, etc. – but this would be for another
article, I guess. :)