In praise of Commonmark: wrangle (R)Markdown files without regex

You might have read my blog post analyzing the social weather of
rOpenSci
onboarding,
based on a text analysis of GitHub issues. I extracted text out of
Markdown-formatted threads with regular expressions. I basically
hammered away at the issues using tools I was familiar with until it
worked! Now I know there’s a much better and cleaner way, that I’ll
present in this note. Read on if you want to extract insights about
text, code, links, etc. from R Markdown reports, Hugo website sources,
GitHub issues… without writing messy and smelly code!

Introduction to Markdown rendering and parsing

This note will appear to you, dear reader, as an html page, either here
on ropensci.org or on R-Bloggers, but I’m writing it as an R Markdown
document, using Markdown syntax. I’ll knit it to Markdown and then
Hugo’s Markdown processor,Blackfriday, will transform
it to html. Elements such as # blabla thus get transformed to

blabla

. Awesome!

The rendering of Markdown to html or XML can also be used as a way toparse it, which is what the spelling package does in order to
identify text
segments
of R Markdown files, before spell checking them only, not code. I had an
aha moment when seeing this spelling strategy: why did I ever use
regex to parse Markdown for text analysis?! Transforming it to XML
first, and then using XPath, would be much cleaner!

As a side-note, realizing how to simplify my old code made me think ofJenny Bryan’s inspiring useR! keynote talk about code
smells. I asked her
whether code full of regular expressions instead of dedicated parsing
tools was a code smell, sadly it doesn’t have a specific name, but she
confirmed my feeling that not using dedicated purpose-built tools
might mean you’ll end up “re-inventing all of that logic yourself, in
hacky way.”. If you have code falling under the definition below, maybe
try to re-factor and if needed get
help.

It’s that feeling when you want to do something that sounds simple but
instead your code is like 10 stack overflow snippets slapped together
that you could never explain to another human what they do 😰pic.twitter.com/IF53AX6QvC

This beats my old code! There’s really something to be said for
purpose-built tools.

Conclusion

I hope this note will inspire you to use commonmark and xml2 when
analyzing Markdown files. As mentioned earlier, Hugo or Jekyll website
sources are Markdown files and GitHub issue threads are too so it should
open up quite a lot of data! If you’re new to XPath, I’d recommend
reading this
introduction. The
results of XML-parsing are also better parsed without (your writing)
regular expressions: I have shown urltools for URL parsing, that base
R has code parsing tools (parse, getParsedData), and I’ve usedtidytext.

Note that if you’re into blog analysis, don’t forget you can also get
information out of the YAML header using… the yaml
package, not regular expressions!

As a bonus, maybe seeing this wrangling inspired you to modify
Markdown files programmatically? E.g. what if I wanted to automatically
replace all level 1 headers with level 2 headers? We’re working on that,
stay tuned and if you want follow this GitHub
thread!