<update mode=confession> I know that I posted this years ago, and other bugs in the post are not being fixed, but this one is a big one. When I wrote the node I didn't realize that the term functional programming was already in use for something different than what I describe below. As a result I seem to have introduced a bad meme into the Perl world. Read on, but do understand that what I describe as functional programming isn't quite what is normally meant by the term...</update>

One of my favorite quotes of all time on programming comes
from Tom Christiansen:

A programmer who hasn't been exposed to all
four of the imperative,
functional, objective, and logical programming
styles has one or
more conceptual blindspots. It's like knowing
how to boil but not
fry. Programming is not a skill one develops
in five easy lessons.

Absolutely. Perl does the first three very naturally,
but most Perl programmers only encounter imperative and
objective programming. Many never even encounter
objective, though that is not a problem for many here.

However I am a fan of functional programming. I mention
this from time to time but say that it takes a real code
example to show what I mean. Well I decided to take my
answer at RE (tilly) 1: Regexes vs. Maintainability to Big, bad, ugly regex problem and turn it into
a real problem. What I chose to do is implement
essentially the check specified at
Perl Monks Approved HTML tags, with an escape mode,
checks to make sure tags balance, and some basic error
reporting when it looks like people were trying to do
stuff but it isn't quite right.

If you have ever wondered what possible use someone
could find for anonymous functions, closures, and that
kind of thing, here is your chance to find out.

For those who do not know what Tom was talking about, here
are some quick definitions:

Imperative programming is when you spend your
time telling the computer what to do.

Functional programming, build some functions,
put them together into bigger ones, when you are done
run your masterpiece and call it a day.

Objective programming access properties of
things and tell them to accomplish tasks for you without
caring what happens under the hood.

Logical programming starts with a series of
rules and actions. When no more rules require acting on,
you are done.

We encounter two of these a lot in Perl. Imperative
programming is just plain old procedural. It is what most
of us do most of the time. Its biggest benefit is that it
is close to what the computer does and how we think. Also
we see a lot of objective programming - that is usually
called Object-Oriented Programming. It offers a layer of
abstraction between what you want done and how it is done.
This hides details and makes the end code easier to
maintain. In the long run it also
has benefits in terms of code-reuse.

We don't encounter the other two as often. Logical
programming is familiar to some here from writing makefiles.
The win is that you can handle very complex sets of
dependencies without having to stop and think about what
exactly will happen. Since that is exactly the kind of
problem that make try to solve, it is a good
fit there.

The most famous representative of functional programming is
Lisp. Functional programming offers similar benefits to
object-oriented programming. Where they differ for me is
that it lets you think about problems differently, and I
frequently find that I can very easily get all of my
configuration information into one place.

What follows is fairly complex, so I will present it in
pieces. If you have not seen functional programming this
will likely feel uncomfortable to read. I certainly found
the first examples I dealt with to be frustrating since I
could not figure out where anything happened. But
just remember that virtually everything here is just
setting up functions for later use, skip to the end if you
need to, and you should be fine.

Basic sanity

Import stuff, be careful, etc.

use strict;
use vars qw(@open_tags %handlers %bal_tag_attribs);
use HTML::Entities qw(encode_entities);
use Carp;

Configuration information

This has the configuration rules for all of the special
stuff we will allow. Tags and attributes, characters
that matter to html that we will let pass through, and
the [code] escape sequence. Note that I did
not use <code> because I wanted this to be easy to
post.

For a real site you might want several of these with
different rules for what is allowed depending on the
section and the user.

Implementation of the handlers

This is the most complicated section by far. This sets up
functions for all of the tasks we will want done. If you
were to rewrite this program in a procedural fashion, you
would find that the logic in this section either would be
repeated many, many times, or else you would get into a lot
of very complex case statements.

That we don't need to do that shows code reuse in action!

A complex site would probably need more in this section
(for instance the linking logic we use), but when all is
said and done, not much.

The actual function you call

Note that the documentation of and in the function
is as long as the function itself. However this is
naturally polymorphic. Even for a very complex site,
you likely wouldn't need to touch this function.

In fact you can understand pretty much everything
right here. You can add a lot of functionality
without getting in the way of your picture of how it
all works!

=head1 B<scrub_input()>
my $scrubbed_text = scrub_input($raw_text, [$handlers]);
This takes a string and an optional ref to a hash of handlers,
and returns html-escaped output, except for the sections handled
by the handlers which do whatever they want.
If handlers are passed their names should be lower case and start
with a character matching [^\w\s\d] or else they will not be
matched properly. While parsing the string, when they can be
matched case insensitively, then the handler is called. It will be
passed a reference to $raw_text right after that matches the name
of the handler. (pos($raw_text) will point to the end of the
name.) It should return the text to be inserted into the output,
and set pos($raw_text) to where to continue parsing from, or 0 if
no text was handled.
Two special handlers that may be used are "pre" and "post" that
will be called before and after (respectively) the raw text is
processed. For consistency they also get a reference to the raw
text.
If no handler is passed, it will use \%handlers instead.
=cut
sub scrub_input {
my $raw = shift;
local @open_tags;
my $handler = shift || \%handlers;
$handler->{pre} ||= sub {return '';};
$handler->{post} ||= sub {
return join '', map "</$_>", reverse @open_tags;
};
my $scrubbed = $handler->{pre}->(\$raw);
# This would be faster with the trie code from node 30896. But
# that is not the point of this example so I have not done that.
# Also note the next line is meant to force an NFA engine to
# match the longest alternative first
my $re_str = join "|",
map {quotemeta} reverse sort keys %$handler;
my $is_handled = qr/$re_str/i;
while ($raw =~ /\G([\w\d\s]*)/gi) {
$scrubbed .= $1;
my $pos = pos($raw);
if ($raw =~ /\G($is_handled)/g) {
$scrubbed .= $handler->{ lc($1) }->(\$raw);
}
unless (pos($raw)) {
if (length($raw) == $pos) {
# EXIT HERE #
return $scrubbed . $handler->{post}->(\$raw);
}
else {
my $char = substr($raw, $pos, 1);
pos($raw) = $pos + 1;
$scrubbed .= &encode_entities($char);
}
}
}
confess("I have no idea how I got here!");
}

For the curious I actually wrote the configuration section
first, then the function at the end and useless handlers.
(Actually ret_tag_handlers returned an empty
list.) I then grew the handlers. First I added the escape
mode. Then I started escaping tags with no attributes
allowed. Next came closing tags. And finally tags with
attributes allowed.

Hello, I am the HTML::Parser nazi. I go around commenting on other people's attempts to parse HTML, and I always yell at them for trying to do something that's very hard themselves, and tell them to use HTML::Parser.

HTML::Parser would actually be an awful fit for this
problem. If you don't believe it, try to duplicate the
functionality the code already has.

The problem is that the incoming document is not HTML.
It is a document in some markup language, some of whose
tags look like html, but which isn't really. I don't want
to spend time worrying about "broken html" that I am going
to just escape. I don't want to worry about valid html
that I want to deny. I want to report custom errors. (Hey,
why not instead of just denying pre-monks image tags, also
give an error with a link to the FAQ?) And I want to include
markup tags you won't find in HTML.

I did a literal escape above using [code]
above. I submit that HTML::Parser would not help with
that. OK, so that should be <code> for this
site, but this site would want to implement a couple of
escaped I didn't. For instance the following handler would
be defined for this site for [ (assuming that
$site_base was
http://www.perlmonks.org/index.pl and hoping that
I don't make any typos):

And, of course, given $node_id there is
probably a function get_node_name available.
And we have that lastnode_id the site keeps
track of. So we also need a handler for [://
to link by ID, and that would be generated by something
like this:

My apologies for using you as a foil, but you just let me
illustrate Tom's point perfectly. All of the stuff I am
saying is obvious to anyone who has played with functional
techniques, but since you haven't you are simply unable to
see the amazing potential inherent in this method of code
organization. And I happen to know that you are
not a bad programmer, but this was a blind spot for you.

Time to put down the pot, we aren't boiling now. This is a
frying pan and I feel like an omelette. :-)

No, you're absolutely right that HTML::Parser in and of itself wouldn't do a good job for this specific problem. My point in posting was really twofold:

Writing a parser like this is very very difficult to get right, and usually it's better to find an existing tool that's already been stress-tested. You got it right because you know what you're doing, but I doubt many others would be able to execute like that.

Ovid's initial problem, which apparently was the seed for your post, was tailor-made for HTML::Parser.

As far as functional programming goes, I'm not a stranger (I just recently replaced Perl code to walk two trees and find differences with compiled ML because it was faster and more conceptually simple), and I certainly support seeing more functional Perl. I'm not necessarily convinced that functional techniques helped in this particular program that much; my claim is that it worked so well because of the strength of the programmer. However, I do appreciate the elegance of the solution. But I continue to submit that your average Perl programmer would botch this problem subtly, and it would make more sense for them to use some pre-rolled solution.

Nice. If I had a bit more time at the moment I'd try writing
this in Haskell just as an exercise to see what it would
look like in a real functional language. :) I might try
it a months or so time, once I've finished exams.

I have to agree with you that all programmers should try
their hand at all the programming styles, even if it's just
so they can experience the joy of functional programming
first hand!

This book has changed my programming world from ground up.
Ok, agreed, it is "challenging", as they like to put it
in some reviews. Don't mind. Go forward. Buy it. Read it.
Study it. Do the exercises. It's simply the best single book
on programming that I know, and the only book that I know
that really captures a lot of the "Zen" of programming.

Update: This book is (incredibly) now
available online.
Thanks to the Arsdigita people! Great service to all programmers.

OK, this reimplementation is for nate. I really did
start out thinking I was going to make the handler modal,
but when I got done it just didn't make sense. For
instance the hack to handle tables (which was really the
reason for wanting it in the first place) would have
resulted in a circular reference which is a memory leak.

What I did instead was added several hooks for pre and
post filters. And made the final routine return a
subroutine that processes markup. It would be possible to
actually use this in a modal way, just set the pos() to
the end of the string and then have the post hook set the
pos() to whatever you wanted it to be. I did most of the
work required, I just didn't think it made sense for the
problem at hand.

For people other than nate, this change fixes a few minor
bugs, can be used to handle the attribute "checked" (that
ability is not shown here), can be used to allow additional
validations of attribute values, and (very important) allows
you to refuse to let new table tags to be opened outside of
a table that you have opened.

The page shows off the fixpoint combinator, the fold combinator,
closures, higher-order functions, and implementations of a
a few algorithms on lists. It's noteworthy how
easy it was to translate these algorithms from Scheme to Perl.
Even the location of parentheses is sometimes similar to
that in Scheme notation. The page finishes with examples of
improper and circular lists.

The parser fully supports XML namespaces, character and
parsed entities, xml:space, CDATA sections, nested entities,
attribute value normalization, etc. The parser
offers support for XML validation, to a full or
a partial degree. It does not use assignments at all.
It could be used to parse HTML too.

One tiny nit in an otherwise great piece of code:
return show_err("Unended <$tag> detected");
I suspect that show_err() needs to escape HTML entities. Wrapping its arguments in <font> tags isn't sufficient.

That is one of several small mistakes in the code. Another
is that there is only one allowed font attribute and it
makes no sense. Another is that the RE engine has a bit of
behaviour that I didn't understand when I wrote the code,
and so I need to somewhere insert
pos($raw) = pos($raw);. I leave verification
that this is not a no-op, plus discovery of how this can
lead to a bug, to a close reading of perlre. An important
one pointed out by nate is that in reality the post will
appear inside of a layout which is itself done with tables.
Various tags that start new parts of a table should only
be allowed inside of a table that you start. Plus he
pointed out that some HTML tags take attributes which do
not follow the usual pattern, for instance checkboxes can
be "checked".

For these reasons and more, I did a rewrite at Functional take 2
which should have somewhat fewer bugs. I long ago made
the decision that (partly because this site does not keep
revision histories) I wanted to leave the original as it
was, flaws and all.

When putting a smiley right before a closing parenthesis, do you:

Use two parentheses: (Like this: :) )
Use one parenthesis: (Like this: :)
Reverse direction of the smiley: (Like this: (: )
Use angle/square brackets instead of parentheses
Use C-style commenting to set the smiley off from the closing parenthesis
Make the smiley a dunce: (:>
I disapprove of emoticons
Other