How Not To Write A Subroutine

It's been pointed out to me that many programmers have issues with my blog because I'm writing about allomorphism, exceptions as flow control and other topics that are, frankly, not something a beginner programmer is going to warm to. I think that's a fair point and I'm going to try to start including information aimed at those new to Perl and those new to programming.

I'll start with subroutines. Specifically, an old, horrible example.

Back in the days of CGI being king, you often saw a variant of the following routine to parse CGI data:

There are, frankly, so many things wrong with this (I won't touch on all of them because I could write for hours), but I'll touch on some of the highlights.

Note: this is very, very old code and we generally avoid CGI today, but I still see this and variants cropping up (fortunately, less than I used to).

Black Boxes

The first thing to remember about a subroutine is that it should be a "black box". You pass data in and get data out. Here you would actually want a second routine which reads the form data and and then hands it to the parse_cgi code. Why? See the POST handling? When you read from STDIN, you empty STDIN. That data cannot be reread without jumping through serious hoops (and who wants to explain how to tie STDIN?). What happens if you want to debug that? If you try to print the STDIN data after this subroutine, the data have been read and you have nothing to print. Print the data before this subroutine and the subroutine has no data to read. Oops. You lose.

That allows you to capture the form data before this call and do other things with it, if needed. Of course, it also allows you to do this, if you want:

my %formdata = parse_cgi(read_form_data_from_file($filename));

You might not need to do that, but by separating these behaviors and treating your subroutines as black boxes, you get a lot of flexibility.

Cohesion

Subroutines and methods should generally do one thing and do it well (we call this cohesion). The code in question both reads form data and attempts to parse it. In "Black Boxes", I showed how passing arguments to a subroutine makes it more flexible. This was achieved by separating the reading of data from the parsing of it. However, there's still another classic mistake here:

29 $value =~ s/<!--(.|\n)*-->//g;

I used to see this a lot in code. Aside from the fact that this regular expression is broken badly, what's it doing there? I've never seen anyone explain it, but my guess is that this was some strange attempt to strip server side includes (SSIs) from the data. If so, this is a terribly misguided attempt to provide some security for incoming data (there are much better ways of dealing with this, but I won't touch on that here).

If that's the case, what do you do if you want HTML comments and SSIs? Maybe you want to log the raw data, but since the substitution is poor, it has a good chance of mangling said data and you won't know because you tried to do too much in one routine (separating the reading and parsing can work around this). I can't imagine why you would want to keep them, but if you do, this code won't let you.

If you really do want to properly scrub your incoming data, though, there's a lot more than SSIs you want to eliminate. Thus, you should have dedicated code written for scrubbing the data, not trying to hack it into parsing code.

Understand Your Data

There are two serious flaws here. First, let's look at a valid query string on a URL (and note that ';' is a valid separator, but the ampersand is all the code split on):

http://www.example.com/?sport=football;sport=baseball

It's perfectly OK to have multiple values for a single key. The venerable CGI.pm would allow you to fetch them with this:

my @sports = $cgi->param('sport');

This code, however, would return this (complete with extra space):

%formdata = (
sport => "football, baseball",
);

Presumably this is to allow you to split on commas (plus that extra space tucked in there). But what if the value contains a comma? Oops. We have multiple values. That should be an array reference, not a string.

Now let's look at another valid URL:

http://www.example.com/?sport=football;sport=sport=baseball

Guess what your data structure has?

%formdata = (
sport => "baseball",
);

Why it contains this is an exercise for the reader.

If you don't really understand the nature of your data, you're going to write bad code like this.

Reuse Code

Even for this relatively simple task (and there are plenty of other issues I haven't touched on), it's easy to misunderstand what's going on. That's why, when CGI was so popular, the Perl community enjoined everyone to use the CGI.pm module. We don't do that any more -- there are far better ways of approaching this problem -- but we still don't recommend that you write your own code.

Summary

Subroutines should be black boxes that do one thing well. Don't rely on globals and don't do too much. They'll be more flexible that way.

These "new user" friendly blog posts will be an "on again/off again" sort of thing, but if they turn out to be popular, I'll write more.

12 Comments

There is another great benefit to splitting that subroutine into the read & parse pieces as shown in "Black Boxes":

You can then test the parser.

A novice can figure out how to call parse_cgi on a bunch of test strings and confirm the result is as expected; doing it on the combined form is much harder (tie or kluges involving pipe come to mind).

To be fair, if you're saying this because of our discussion in Vienna, it's not that many programmers have issues with your blog or that the people who said something similar in that conversation meant that you should do anything different.

I said that you had a limited audience (in a non-bad way) because your readers had to actually know quite a bit of craft to follow you. That's not a bad thing. The Perl world needs that too. There has to be something for people to follow once they learn how the basics of programming or testing.

@brian: our discussion in Vienna was the catalyst, but not the initial cause. This has been kicking around in my brain for a while. I just keep remembering how hard it was when I was leading Portland.pm to get material aimed at new people. We had (have) a group of very strong Perl programmers, some of the best known in the world, and trying to get a new programmer to stand in front of them was hard. Likewise, the Portland pros tended to have talks on OO programming, Parrot, prototyped-based programming, hacking Perl internals and other topics which a new programmer just won't get.

I want to try and encourage conversation at all levels, so maybe my nifty debugger hacks aren't helping newer programmers :) I also considered starting a second blog here, one aimed at developers coming to Perl. That would make it easier for some folks to filter which content they wanted to read.

CGI::Simple is a fork of CGI.pm that is written in a clearer style, removing some legacy cruft. But note the entries that remain in the bug tracker for it. CGI::Simple has far fewer users and there are sometimes in bugs in CGI::Simple that have been fixed in CGI.pm.

Today I'm thinking of a fix multi-line header parsing which will appear in the next release of CGI::Simple:

Completely agree with returning the reference. Just recently I've been bitten by returning list -- I returned list of arrayrefs and sent that to Template Toolkit. Worked well until there was only one item in the list (TT traversed the arrayref as top-level list). Took me quite some time to figure out.

Btw, thanks for your testing series. I started using test just about half year ago and it really helping me find better ways how to test various scenarios.

given that this is supposedly a 'beginner's post' I wonder if you could explain why you tend to use => instead of , in the push/split invocations; is it for visibility? some other reason I haven't yet divined in my quest towards perl monkdom?

Thanks for the post and the blog in general, it's been very useful in stretching my brain towards a greater perl understanding.

@Marco: I can't say it's a particularly good practice (particularly since it's going to break in Perl 6), but early on in my Perl programming, I found I preferred the visual distinction between the subject and predicate in certain constructions for which this is appropriate:

my $result = join "\n" => @subject;
@subject = split /\n/ => $result;

I simply found that clearer than the alternative:

my $result = join "\n", @subject;
@subject = split /\n/, $result;

Separating the subject and predicate this way made sense to me, but it's something I should probably stop since it gives people pause and confusing newer programmers is not a good thing. (It's not a bad thing, either, but only in the sense that there's a difference between confusing code being a side-effect or a result of deliberate practice).

Leave a comment

Name

Email Address

URL

Remember personal info?

Comments
(You may use HTML tags for style)

About Ovid

Have Perl; Will Travel. Freelance Perl/Testing/Agile consultant.
Photo by http://www.circle23.com/. Warning: that site is not safe for work. The photographer is a good friend of mine, though, and it's appropriate to credit his work.