(Reverse) Literate Programming on Jekyll/Github

In this post I show two ways to document a piece of code on Jekyll, the static site generator powering GitHub Pages (including this blog).

The Story Behind the Story

I tried to explain how this piece of code found that (1+2)!! + 3!^4 - 5 = 2011, but C++ is not famous for being concise and transparent. So I tried to give the general idea, and gave the full source in one block, but it was not very satisfying. Full of remorse, I’m now writing this post about how I should have written the previous post, so I can sleep at night again.

I’ll get some inspiration from literate programming, which I’ll quickly present. Then I’ll show how it could have been used on the previous post. This method won’t work for any language, so I will present another one, based on a pending addition to Jekyll.

Literate Programming

The literate programming paradigm, as conceived by Knuth, represents a move away from writing programs in the manner and order imposed by the computer, and instead enables programmers to develop programs in the order demanded by the logic and flow of their thoughts.

Basically, the documentation and the code is written in the same file, but in the order of the natural speech, not in the order expected by the compiler/interpreter, unlike javadoc-alike systems. A preprocessor will extract the code, and reassemble it in the order expected by the compiler/interpreter (hereafter: the processor).

In order to show to the processor that we are serious about not caring about him, literate programming can be done with latex, or a word processor. Yes, I spent an internship coding (Spec#) within Word, and yes, I actually enjoyed it. More recently I discovered Sweave which makes statistical exploration with R enjoyable (for a programmer).

Of course, the rule that any idea has already been implemented (better) and published also applies to literate programming with Markdown (Markdown being the notation used to format this article). In this case, a script comments out all the text that is not marked as code. I guess the author calls this “lightweight literate programming” because the code still has to be written in the order expected by the processor.

Manual Literate Programming On GitHub’s Jekyll

The idea here is to use the #include directive of the C/C++ preprocessor along with the {% include %} tag of the Liquid Extensions of Jekyll to build the “interconnected ‘webs’ of macros” of Knuth’s vision of literate programming. The extraction step is manual: each macro will be written in a separate file.

This Markdown document is describing how to write a Markdown document describing other files by including them… Way too meta for me. If you too need to lower down the abstraction, you can have a look at the files I’m talking about:

The whole point of all this is that the code remains compilable (e.g. with g++ -o search search.cpp) without any transformation, just like the monolithic main.cpp, but the explanation can be a lot more clear. It also means that there is no need to propagate modifications into the description should the code changes or vice versa.

Reverse Literate Programming on Github’s Jekyll

Now the problem is that a lot of programming languages are too elegant to dirty their parsing hands with the include of a preprocessor. Of course, most languages allow the import of some notion of modules, but it’s just the processor trying to dictate us the order in which to describe a program again.

Without heavy weaponry like a real literate programming preprocessor, we will have to call a truce. We’ll still write the code in the order of the processor, but explain it in a human order. Anyway, no one actually believed that I wrote the code of previous section in the presented order.

Instead of writing a story that will be transformed into a program, we will write a program that will be reassembled into a story – hence the reverse literate programming. The feature needed here is to be able to extract parts of the source code. Jekyll’s {% include %} can only extract the whole, but if this patch makes it through, we will be able to use blocks like:

moduleJekyllclassExtractBlock<Liquid::Blockdefunknown_tag(name,content,tokens)casenamewhen"after"@after=content.stripwhen"before"@before=content.stripelsesuperendenddefinitialize(tag_name,file,tokens)super@file=file.stripenddefrender(context)includes_dir=File.join(context.registers[:site].source,'_includes')ifFile.symlink?(includes_dir)return"Includes directory '#{includes_dir}' cannot be a symlink"endif@file!~/^[a-zA-Z0-9_\/\.-]+$/||@file=~/\.\//||@file=~/\/\./return"Include file '#{@file}' contains invalid characters or sequences"endDir.chdir(includes_dir)dochoices=Dir['**/*'].reject{|x|File.symlink?(x)}ifchoices.include?(@file)source=File.read(@file)#Preceding code is the same as IncludeTag.rendermatchdata=source.match/#{Regexp.escape(@after)}[^\n]*\n(.*)\n.*#{Regexp.escape(@before)}/mifmatchdata.nil?ormatchdata.size<2return"Unable to determine which lines of '#{@file}' "+" are between '#{@after}' and '#{@before}'"endsource=matchdata[1]#Following code is the same as IncludeTag.renderpartial=Liquid::Template.parse(source)context.stackdopartial.render(context)endelse"Included file '#{@file}' not found in _includes directory"endendendendendLiquid::Template.register_tag('extract',Jekyll::ExtractBlock)

I could simply write a markdown file containing:

123456

Blablabla, I just copy-pasted `include.rb` and added this processing to the
content read from the file:
{% extract extract.rb %}
{% after #Preceding code is the same as IncludeTag.render %}
{% before #Following code is the same as IncludeTag.render %}
{% endextract %}

Which would give:

Blablabla, I just copy-pasted include.rb and added this processing to the
content read from the file:

matchdata=source.match/#{Regexp.escape(@after)}[^\n]*\n(.*)\n.*#{Regexp.escape(@before)}/mifmatchdata.nil?ormatchdata.size<2return"Unable to determine which lines of '#{@file}' "+" are between '#{@after}' and '#{@before}'"endsource=matchdata[1]

Alternatives

I can think of two other ways to specify the part of the file to be included. First, specifying the numbers of the first and last line to keep would be the easiest, but it would mean that the documentation has to be updated each time the file changes.

Second, the a from/until pair could specify border lines similarly to the after/before pair, except that the content would include these two lines. The problem here is that premature truncations could happen if we want to stop the inclusion on lines such as } or end.

In the end, the only assumption in the after/before pair is that one can add lines only for the sake of documentation, but I have not seen a programming langage which does not allow comments, so it is always possible to delimit regions.