How to generate a blog-wide word count in Jekyll

February 13, 2013

One of the “minor” tasks left on my to-do list since making the transition to Jekyll was to come up with a quick way to generate a blog-wide word count. This metric is just something I like to have handy (and I may end up putting it on the About page). (Some of you may remember that years ago I wrote a plugin for WordPress to do this very thing.)

Initially, I tried to tackle the problem from just the shell, and it is doable, but inaccurate. All of Jekyll’s blog posts exist in a single directory, and so the following does work:

wc -w * | tail -1 | cut -b -8

Obviously, this just pipes every blog post through the wc command. The problem though is that it doesn’t ignore the YAML front matter present in every post, thus adding to the count words that shouldn’t be included. Clearly, these extra words, especially over a very large site, can really skew your word count.

After that idea crashed and burned, I thought I could just come up with a regex that would grab the YAML headers, use grep or egrep to do the matching, and then pipe the inverse of the result into the wc command. I ran into a snag though after coming up with a regex, namely grep's inability to recognize modifiers. Specifically, I needed to specify “single-line” mode so that the “.” operator would match any character, including newlines.

After banging my head against the wall with that for a while, I just decided to tackle the problem in Python, and was able to whip up a solution pretty quickly, despite my inexperience with the language. The following is what I came up with: