This is the seventh part of a nine-part article on Perl one-liners. Perl is not Perl without regular expressions, therefore in this part I will come up with and explain various Perl regular expressions. Please see part one for the introduction of the series.

Perl one-liners is my attempt to create "perl1line.txt" that is similar to "awk1line.txt" and "sed1line.txt" that have been so popular among Awk and Sed programmers, and Unix sysadmins. I will release the perl1line.txt in the next part of the series.

This regex doesn't guarantee that the thing that got matched is in fact a valid IP. All it does is match something that looks like an IP. It matches a number followed by a dot four times. For example, it matches a valid IP 81.198.240.140 and it also matches an invalid IP such as 923.844.1.999.

Here is how it works. The ^ at the beginning of regex is an anchor that matches the beginning of string. Next \d{1,3} matches one, two or three consecutive digits. The \. matches a dot. The $ at the end is an anchor that matches the end of the string. It's important to use both ^ and $ anchors, otherwise strings like foo213.3.1.2bar would also match.

This regex can be simplified by grouping the first three repeated \d{1,3}\. expressions:

/^(\d{1,3}\.){3}\d{1,3}$/

110. Test if a number is in range 0-255.

/^([0-9]|[0-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])$/

Here is how it works. A number can either be one digit, two digit or three digit. If it's a one digit number then we allow it to be anything [0-9]. If it's two digit, we also allow it to be any combination of [0-9][0-9]. However if it's a three digit number, it has to be either one hundred-something or two-hundred something. If it'e one hundred-something, then 1[0-9][0-9] matches it. If it's two hundred-something then it's either something up to 249, which is matched by 2[0-4][0-9] or it's 250-255, which is matched by 25[0-5].

This regexp combines the previous two. It uses the my $ip_part = qr/.../ operator compiles the regular expression and puts it in $ip_part variable. Then the $ip_part is used to match the four parts of the IP address.

112. Check if the string looks like an email address.

/.+@.+\..+/

This regex makes sure that the string looks like an email address. Notice that I say "looks like". It doesn't guarantee it is an email address. Here is how it works - first it matches something up to the @ symbol, then it matches as much as possible until it finds a dot, and then it matches some more. If this succeeds, then it it's something that at least looks like email address with the @ symbol and a dot in it.

For example, cats@catonmat.net matches but cats@catonmat doesn't because the regex can't match the dot \. that is necessary.

Much more robust way to check if a string is a valid email would be to use Email::Valid module:

Checking if the string is a number is really difficult. I based my regex and explanation on the one in Perl Cookbook.

Perl offers \d that matches digits 0-9. So we can start with:

/^\d+$/

This regex matches one or more digits \d starting at the beginning of the string ^ and ending at the end of the string $. However this doesn't match numbers such as +3 and -3. Let's modify the regex to match them:

/^[+-]?\d+$/

Here the [+-]? means match an optional plus or a minus before the digits. This now matches +3 and -3 but it doesn't match -0.3. Let's add that:

/^[+-]?\d+\.?\d*$/

Now we have expanded the previous regex by adding \.?\d*, which matches an optional dot followed by zero or more numbers. Now we're in business and this regex also matches numbers like -0.3 and 0.3.

Much better way to match a decimal number is to use Regexp::Common module that offers various useful regexes. For example, to match an integer you can use $RE{num}{int} from Regexp::Common.

How about positive hexadecimal numbers? Here is how:

/^0x[0-9a-f]+$/i

This matches the hex prefix 0x followed by hex number itself. The /i flag at the end makes sure that the match is case insensitive. For example, 0x5af matches, 0X5Fa matches but 97 doesn't, cause it's just a decimal number.

It's better to use $RE{num}{hex} because it supports negative numbers, decimal places and number grouping.

Now how about octal? Here is how:

/^0[0-7]+$/

Octal numbers are prefixed by 0, which is followed by octal digits 0-7. For example, 013 matches but 09 doesn't, cause it's not a valid octal number.

It's better to use $RE{num}{oct} because of the same reasons as above.

Finally binary:

/^[01]+$/

Binary base consists of just 0s and 1s. For example, 010101 matches but 210101 doesn't, because 2 is not a valid binary digit.

It's better to use $RE{num}{bin} because of the same reasons as above.

114. Check if a word appears twice in the string.

/(word).*\1/

This regex matches word followed by something or nothing at all, followed by the same word. Here the (word) captures the word in group 1 and \1 refers to contents of group 1, therefore it's almost the same as writing /(word).*word/

For example, silly things are silly matches /(silly).*\1/, but silly things are boring doesn't, because silly is not repeated in the string.

115. Increase all numbers by one in the string.

$str =~ s/(\d+)/$1+1/ge

Here we use the substitution operator s///. It matches all integers (\d+), puts them in capture group 1, then it replaces them with their value incremented by one $1+1. The g flag makes sure it finds all the numbers in the string, and the e flag evaluates $1+1 as a Perl expression.

For example, this 1234 is awesome 444 gets turned into this 1235 is awesome 445.

116. Extract HTTP User-Agent string from the HTTP headers.

/^User-Agent: (.+)$/

HTTP headers are formatted as Key: Value pairs. It's very easy to parse such strings, you just instruct the regex engine to save the Value part in $1 group variable.

This is really tricky and smart. To understand it, take a look at man ascii. You'll see that space starts at value 0x20 and the ~ character is 0x7e. All the characters between a space and ~ are printable. This regular expression matches exactly that. The [ -~] defines a range of characters from space till ~. This is my favorite regexp of all time.

You can invert the match by placing ^ as the first character in the group:

/[^ -~]/

This matches the opposite of [ -~].

118. Match text between two HTML tags.

m|<strong>([^<]*)</strong>|

This regex matches everything between <strong>...</strong> HTML tags. The trick here is the ([^<]*), which matches as much as possible until it finds a < character, which starts the next tag.

Alternatively you can write:

m|<strong>(.*?)</strong>|

But this is a little different. For example, if the HTML is <strong><em>hello</em></strong> then the first regex doesn't match anything because the < follows <strong> and ([^<]*) matches as little as possible. The second regex matches <em>hello</em> because the (.*?)</strong> matches as little as possible until it finds </strong>, which happens to be <em>hello</em>.

However don't use regular expressions for matching and parsing HTML. Use modules like HTML::TreeBuilder to accomplish the task cleaner.

119. Replace all <b> tags with <strong>

$html =~ s|<(/)?b>|<$1strong>|g

Here I assume that the HTML is in variable $html. Next the <(/)?b> matches the opening and closing <b> tags, captures the optional closing tag slash in group $1 and then replaces the matched tag with either <strong> or </strong>, depending on if it was an opening or closing tag.

120. Extract all matches from a regular expression.

my @matches = $text =~ /regex/g;

Here the regular expression gets evaluated in the list context that makes it return all the matches. The matches get put in the @matches variable.

For example, the following regex extracts all numbers from a string:

my $t = "10 hello 25 moo 31 foo";
my @nums = $text =~ /\d+/g;

@nums now contains (10, 25, 30).

Perl one-liners explained e-book

I've now written the "Perl One-Liners Explained" e-book based on this article series. I went through all the one-liners, improved explanations, fixed mistakes and typos, added a bunch of new one-liners, added an introduction to Perl one-liners and a new chapter on Perl's special variables. Please take a look:

Nicely done, but #119 is in error as of this writing. It matches a single digit, or two digits, or a one or two followed by two digits under six. That means it won't match its own one-liner number! Or any other useful numbers like 192 and 168 and 127, which come up a lot in IP addresses.

I believe a correct regexp would be:

/^1?\d{1,2}|2[0-4]\d|25[0-5]$/

That matches:

- any one or two digits, optionally starting with "1", so that's 0-9, 00-99, and also 100-199 and (redundantly but harmlessly) 10-19 again
- 200 to 249
- 250 to 255

/^\d+$/ is faulty. Firstly, /$/ doesn't just match end-of-string, it will also match at a newline followed by end-of-string. So the regexp matches "1\n" as well as "1". Many of your regexps suffer this flaw.

Secondly, /\d/ is not a synonym for /[0-9]/. It also matches many Unicode digit-like characters. So your regexp matches "\x{666}", which probably doesn't look sufficiently like a number for whatever you were planning to do with it. Many of your regexps have this flaw too.

So /^\d+$/ should be /\A[0-9]+\z/, and many of the other regexps should be amended similarly.

#112 is VERY bad. If you ever have to validate an email address, you might want to speed a few minutes and find a regex that does it right. It is a complex problem that does deserve some time. Do NOT roll your own solution just because you [think you] can.

In this case, enforcing the dot in the domain part is a bad idea not only because it violates the standard (that would be RFC 822) because there are legitimate use cases. for example:

I know of one company that has two dozen (no kidding) different functions their flagship product that are supposed to validate email addresses. Most of them are non-trivial and definitely took some time to implement. Not one of them is correct.

Dear Peter Krumin, I know its boring but please take a look at the specs from time to time, especially if you want to publish something like this. This I-don't-care-if-I-violate-the-standard attitude that we see so often these days (especially but not exclusively in corporate environments) really bugs me because in the end we all suffer from it. Right?

Regarding 109: in general this is nasty approach to match something between start and end of the string as you may run in funny problems with it. I'd recommend to use \b which matches word boundary around (please look into perl regexp specs for it).

Regarding 112: Although criticized a lot there is one significant problem - you match .+ before @ which means "everything before @". Please note "everything" may include spaces and other things you do not want to see.

117: Important to note you are matching a single character this way, not characters. :)

Did not check others in details but well - good idea of collecting this kind of stuff and good luck!

this pattern is tested in regexpal.com. it works successfully, but if any special character is there at the front for eg. @,%, nbsp etc...it is also highlighting the same.
Ho to unhighlight that special character