Something cool about Perl 6 every day

Primary Menu

Day 2 – Bioinformatics with Perl 6

Summer of 2016 and I’m preparing to help teach a class on metagenomics with my boss, Dr. Bonnie Hurwitz, at The University of Arizona. She and I are both alumi of Dr. Lincoln Stein’s lab when he was at Cold Spring Harbor where we worked on Gramene, a comparative genomics platform for plants. Way back in 1999, Lincoln created an intensive two-week course at CSHL called “Programming for Biologists” to teach everything from Unix to Perl to BLAST and CGI. I was fortunate to be a teaching assistant several times over the years and came to enjoy helping bench scientists learn computational tools and techniques. In the fall 2015 debut class of Bonnie’s class, we used Lincoln’s material to teach Perl 5, but I was ready to push into a new language.

Over the years, I’ve played with Python, Lisp, Ruby, Haskell, Prolog, and others. Python would have been the obvious choice to teach our students, but I felt like I already knew an interpreted, dynamically typed language. While I admire the type safety of Haskell, I can’t imagine being able to teach command-line Unix, HPC, metagenomic analysis, and Haskell to absolute beginners in one semester. I decided to spend a month over the summer writing in Perl 6, and it didn’t take long until I was hooked.

There were no print books available and only a handful of online resources like https://docs.perl6.org/, http://perl6intro.com/, and https://learnxinyminutes.com/docs/perl6/. None of these were tailored to beginning scientists, so I started writing my own. My idea was to teach solutions to common problems in biology using the various styles and strengths of the language. For each task, I present maybe 2 or sometimes 6 versions, usually getting shorter as I teach more powerful techniques gained by functional programming techniques or object-oriented programming.

As an example, our students are writing their final program for the semester, wrapping up a re-analysis of the skin microbiome (Grice et al 2011). Over these last few months, they’ve followed a series of detailed protocols to download the raw reads, push them through quality control measures, assemble into “contigs” (contiguous bits of sequence), call putative genes, estimate taxonomy, and annotate gene to determine the functions of the organisms at the various body sites (e.g., armpit vs toe web). Their last assignment is to annotate the contigs with KEGG categories.

Since KEGG supports itself with yearly subscriptions, among other funding, they don’t make it exactly easy to get all the data our students need. We used uproc to attach KEGG identifiers, and we want to group those identifiers into categories to find the top 5 functions happening in each student’s subset of samples. I found a handy shell script to use the KEGG API to download KEGG-IDs-to-pathways and pathways-to-categories. So now we need to link KEGG IDs (e.g., “K01425”) to the pathway category (e.g., “GABAergic synapse”) via the “path:map04727”:

I’d like to break this down to explain all the lovely goodies. First and thing-that-completely-sold-me-on-Perl-6 is the special MAIN subroutine and all the goodness packed into signatures. Named arguments, e.g., “–pathway-list,” can defined by creating Pairs, which is as simple as putting a “:” in front of the variable that will hold the argument’s value. I can assign a default value with “=”, constrain the value with a Type like “Str” or “Bool,” and even add arbitrary conditions like checking if the argument is an existing file (“*.IO.f”) or directory (“*.IO.d”). If I see myself reusing those same checks (e.g., I take multiple “file/dir” arguments), I can easily create my own “subset”:

I like to use reasonable defaults for my arguments, and here I wanted to show how I could check if the output file already exists and how to ask the user via “prompt” if they wish to overwrite it. In Perl 5, I would have written:

exit unless lc($overwrite) =~ /^y/;

A direct translation to Perl 6 would use “.lc” (lowercase) as a Str method and using the ~~ “smart match” operator:

exit unless $overwrite.lc ~~ /^y/;

But here I wanted to show a more Python-ish use of “starts-with” so as to avoid freaking out beginners with regular expressions. (I like that Perl has a long history of borrowing/stealing the best features of other languages!)

I am very happy with the new “open” routine as the Perl 5 way always seems backwards:

File I/O is very easy (IMHO) — I can coerce/cast a variable with “.IO” and then call “lines” (or “words” or even “comb” if I wanted letters). Here I want to process each line, using our trusty old “map” (from functional programming) to apply the “split” function to the “*” (Whatever) to get a list which, for visual purposes, I join on “=”. The result of the lines/map is itself a list which I joined on commas so you can see the two lists together.

Something that is completely different between Perl 5 and 6 is how they handle lists of lists. Perl 5 would flatten them:

DB x ((1,2), (3,4))
0 1
1 2
2 3
3 4
DB x scalar ((1,2), (3,4))
0 4

But Perl 6 allows nesting:

> ((1,2), (3,4))
((1 2) (3 4))
> ((1,2), (3,4)).elems
2

But I can’t create a hash from a list of lists, only from a list of pairs:

The “dir” routine gives you a directory listing, and, since my type constraint already checked that the variable is an existing directory, I can rest assured that this will work. The result of “dir” is a list of IO::Path objects, and I can “grep” (“filter” in some other languages) for those that smart-match to the pattern of the string “.ko” anchored to the end (“$”). (I could have also used the Python-like “ends-with” method.)

The result of the dir/grep call is a list, and as I wrote in another post, you can call call “kv” (as well as “pairs“) on any list to get both the position and value of each member:

The other bit of awesomeness is that Perl 6 allows you to consume a variable number of elements of a list with optional defaults if the list is not evenly divisible by the number of elements you’ve requested:

Heck, you don’t even have to declare them as we have implicit variables:

> for 1..4 { put "$^a, $^b" }
1, 2
3, 4

So then I have a (zero-offset which is why I add 1) file number and name which I can use to print my progress using “printf.” In Perl 5, I would have to establish my counter $i before the loop (which means it would continue to be visible after the loop) and be sure to increment the counter while understanding the nuances of ++$i vs $i++:

As in Perl 5, “next” and “last” are still part of control flow, and I skip over mapping IDs that I didn’t find in the mapping file:

my $cat = %map{ $map-id } or next;

The KO ID looks like “ko:K00001”, so I need to remove the leading “ko:” which is easily accomplished with the “subst” (substitute) operation:

my $kegg-id = $ko-id.subst(/^'ko:'/, '');

In Perl 5 I would have done this, which I feel is unquestionably more opaque:

(my $kegg_id = $ko_id) =~ s/^ko://;

It’s worth noting that in Perl 5 I might also have made the decision to reuse and mutate the $ko_id like so:

$ko_id =~ s/^ko://;

But in Perl 6, the my $ko-id variable is not actually variable — by default it is read-only. In the following example, you see that “subst” returns a new string with the substitutions requested, but the string itself remains unchanged:

All this is a nod to the very good influence of purely functional programming languages where data is immutable. It’s not nice to always be constrained by this idea, but here the application, I believe, enforces a very good programming practice.

Some of the KEGG IDs might be “KO” (KEGG Orthology), so I want to ensure that I’m only dealing with strings that look like “K00001” — a capital “K” followed by five digits:

if $kegg-id.match(/^ K \d ** 5 $/) {

It’s also possible to write that with the smart-match operator. Both will return Match if successful:

Like Perl 5, version 6 still (IMO) makes easy things easy and hard things possible, and I hope this article makes you want to use Perl 6 today. I will continue to add chapters and examples to my book and Github repo in the hopes that people will steal as much as they need to get started.