Friday, October 9, 2015

Some days have passed since the last time we did gym and by now you should be properly rested and ready for another Powershell workout. Today we will start using modulo as a way to find out prime numbers, and then we will move to some kind of pretty obfuscated code to get the very same piece of information.

THE THEORY OF PRIME NUMBERS

A prime number is a number greater than 1 that has no positive divisors other than 1 and itself. For example, 13 is prime, as only 1 and 13 divide it, whereas 14 is not, since it has the divisors 2 and 7 in addition to 1. Nothing new here since those days at school.

A lot of brilliant people have been involved in projects around the world to find the largest known prime number, since they are infinite. Just out of curiosity, you may want to know that the largest known prime number to date is a number with more than 17 million digits and can only be written with this notation:

For sure you can’t use Powershell nor any standard CPU architecture to calculate that, but we can all the same have some fun with the code without going out hunting big numbers.

USING MODULO TO FIND PRIMES

Let’s see how that modulo operator in Powershell can be used to find primes.

Different solutions exist.

The first method, which comes out straight from prime definition, is based on trial division:

Basically, when $Number is less than 2, that’s not a prime, by definition. When $Number is 2, that’s a prime, also by definition. And then we proceed with some kind of brute force method, that is, we check for divisibility of $Number against any number between 2 and $Number-1 until we find one. If we find at least a divisor, then $Number is not prime. We say in that case that $Number is composite.

Here you should notice the line:

if (!($Number % $i))

which returns 0 when $Number can be divided by $i, because the remainder is 0, which the if statement smartly interprets as a [boolean]$False.

The previous algorithm is simple but inefficient because it uses trial division up to $Number-1, which is far from being an optimal method if finding primes.

One way to improve this algorithm is to test only up to sqrt($Number).

Now the question is: why do we check up to the square root of a prime number to determine if it is prime? As explained out on StackOverflow: "If a number n is not a prime, it can be factored into two factors a and b: n = a*b. If both a and b were greater than the square root of n, a*b would be greater than n. So at least one of those factors must be less or equal to the square root of n, and to check if n is prime, we only need to test for factors less than or equal to the square root."

To get better performance, we have to go back in time and take advantage of a pretty ancient method to find out if a number is a prime. This method is called by the name of its inventor: the sieve of Eratosthenes (to give you an idea, this man calculated the Earth's circumference around 240 BC just looking down a deep well in the middle of the Egyptian desert near Aswan). In his method you take a list of all the integers less than or equal to n (and greater than one), then you filter out the multiples of all primes less than n. The numbers that are left are all primes.

To make it simple, suppose you have the numbers from 1 to 13. Write them down:

1 2 3 4 5 6 7 8 9 10 11 12 13

The number 1 is not a prime by definition, so knock it out. Then you filter out all the multiples of 2, apart from 2 itself, which is a prime:

2 3 5 7 9 11 13

Then you keep filtering out every multiple of 3, 4, 5 and so on. The numbers that are left are, you name it, primes:

2 3 5 7 11 13

To translate this to Powershell we need to build a sorted hashtable using a SortedDictionary:

In addition to this official answer (which is so Powershell 1.0), there were some interesting solutions, with some people going brute force while other went with a sieve. Shay Levy and Chris Warwick had interesting approaches. Shay went on with trial division using modulo (as we explained) up to ([math]::Sqrt($num)))

Chris used the sieve of Eratosthenes and added a minor but important optimization: he shaved off all the elements less than the square of the current prime, because they have already been set by previous factors:

for ($j=$i*$i; $j -le $max; $j+=$i) {$sieve[$j]=0}

There was also an interesting entry by Marc van Orsouw, who was able to solve this event with a slow but short one-liner that I kept for future reference:

2;2..200 |? {$i=$_;(2..($i -1) |% {$i%$_}) -notcontains 0}

Well, this is not a pure oneliner, because of the semi-colon. Nonetheless the approach is interesting and therefore I mentioned it. Based on what we said before, we know now that this oneliner can be improved by topping our tests against the square root of the number to test:

2..200|?{$i=$_;(2..[math]::sqrt($i)|%{$i%$_})-notcontains0}

The performance improvement is, again, impressive, moving the slider from 47 seconds to a bit more than 1 second:

This is starting to become juicy. Oneliners are a strange world, where you push the capabilities of a language to its further extent, you bend it, and sometimes you make it do things it was not conceived for. In these cases the code starts looking messy, obfuscated and cryptic. No fear to have tough. Keep thinking to it as to a game.

JUST ANOTHER PERL HACKER,

In the early 1980s, following the boom of personal computing, some people started to write oneliners in BASIC, opening the way to surprisingly obfuscated programming. Back in 1984, the American computer scientists Landon Curt Noll and Larry Bassel started the International Obfuscated C Code Contest (IOCCC), as a competition to write the most opaque C code.

The goals of the contest, as it is explained on the main page of the competition, are:

To write the most Obscure/Obfuscated C program within the rules.

To show the importance of programming style, in an ironic way.

To stress C compilers with unusual code.

To illustrate some of the subtleties of the C language.

To provide a safe forum for poor C code. :-)

Taking its inspiration from the International Obfuscated C Contest, the Obfuscated Perl Contest was run by The Perl Journal from 1996 until 2000. From that very moment Perl started achieving a pretty good reputation for recreational activities such as code obfuscation.As far as I was able to understand, this obfuscated Perl coding competition spread out in a community (whose members are known as Perl Monks) where programmers signed their messages with JAPHs.A JAPH is a Perl program that outputs the sentence “Just another Perl hacker," which is used as a signature.The first JAPH is one by Randal Schwartz(aka Merlyn) who in 1988 signed his message with:

print "Just another Perl hacker,"

Building on top of that, he started inventing new obfuscated JAPHs for his signatures:

$_=",rekcah lreP rehtona tsuJ";s/.$/eval 'print $&',""/e while length

Starting from this kind of obfuscated signature, Perl Monks have started to write their JAPHs at full speed, many of which can still be found at the JAPH archive I linked above.

ABIGAIL, REGEX AND PRIMES, ALL IN A JAPH

One of the most famous JAPH is one from the mind twister Perl programmer known as Abigail, who wrote a well-known signature which, using a regular expression, checks for primality of a given number.

Actually I found two different JAPH-style signatures by Abigail containing similar regexes, or regexen (which is the Anglo-Saxon plural of regex, as Larry Wall, the author or Perl, wrote).

The second JAPH, which dates back to September, 16 1998, is an improvement of the other, is:

perl -wle 'print "Prime" if (1 x shift) !~ /^1?$|^(11+?)\1+$/'

Now if you're not a Perl Monk (Abigail is, and I'm certainly not), you're probably quite confused as to how on earth this works.

There are at least a couple of good articles out there on the subject, but I'll try my hand at it, and I’ll add a conversion to Powershell code as well.

This JAPH uses the regex engine to do the math of checking primality on the number supplied as an argument. The number is retrieved via the shift function and translated into a sequence of 1s.

This means that if we supply 13 as input, it will translate to:

PS C:\> '1' * 13
1111111111111

Beware here on the order of the multiplicands, since Powershell will choose the data type on the first element of the operation. In this case we need the first element to be a string, so that the output is unary, hence the single quotes around the number 1:

Now the pattern matching starts and it is exactly here that the regular expression engine hidden inside our .NET framework kicks in.

The expression is made up of two parts, separated by a pipe, which is an alternation operator. It tells the regex engine to match either everything to the left of the pipe, or everything to the right.

Since, as we saw above, 0 and 1 are not primes by definition, the first part of the regex will match 0 and 1 as non-primes. That’s obtained through the use of a question mark, which is a greedy quantifier that makes the preceding token optional.

So 1? matches the number 1 zero or one time. Let's test it:

PS C:\> 1 -match '^1?$'
True
PS C:\> 11 -match '^1?$'
False

If it’s not exactly one character, as in the second example above which returns $False, the regex tries the right side of the alternation, which is where the calculation for numbers greater the 1 is performed in a very smart way. And that’s what made Abigail enter history of Perl (beside the fact of being one of the best known Perl 5 Porters):

^(11+?)\1+$

There are two parts to this: the capture (between parentheses) and the backreference. The backreference can match one or more times (\1+). So, the engine looks for complete groups, and nothing more, of the same thing. The actual thing we match hardly matter. It comes down being able to break up the string evenly into groups.

For instance, let's consider the number 12. The string for that is a sequence of twelve 1s:

111111111111

Can this be broken up into equal groups with nothing left over? Sure; there are six groups of twos:

11 11 11 11 11 11

The number of groups represents one factor (6), while the length of any one of the groups represents another factor. Since it finds two factors, the number must be composite, so not prime.

If we take the number 13:

1111111111111

And split it up in groups of two, we get:

11 11 11 11 11 11 1

There is a remainig 1, so the number is not composite, for the moment. The check keeps trying with all the numbers until it can get en even result. Split in groups of threes, we get:

111 111 111 111 1

Still a one left alone, so it’s not composite. Splitting in four:

1111 1111 1111 1

Nope. So the engine keeps splitting in increasing length groups until no option is left. This process is analogous to trying to divide a number by successively larger divisors, leaving no remainder. For n = 13 we tried 2, and that failed, and then 3, and then 4, and so on. No divisor succeeded, so that’s not composite number.

But, wait, weren’t we checking for primes, not composites? Yes. That’s where the initial inversion performed with !~ comes in. The !~ is a negated match, so the condition is true if the string does not match the regular expression.

Some people will complain that this algorithm is computationally expensive, or memory devastating. Some will say it is inefficient (why bothering checking for numbers greater than the square root of the length?). Some will say it looks so much like as the brute-forcing methods we saw up above in the beginning of this post.

Personally, I don’t mind. Clever is the word here, and the word fully applies to Abigail, whose algorithm is so powerful that it can find primes only by matching strings whose length is NOT prime.

BEYOND REGULAR EXPRESSIONS AND FINITE STATE MACHINES

I feel like here I must say a few words about Finite State Machines (FSM) and how they relate to Abigail's regex. I'll try to make it as simple as I can, and if after reading the whole section you find yourself left open-mouthed and saying "Huh?" to yourself... well, that's pretty normal. Just allow for some time for the concepts to sink in before saying that this is way over your head and that this blog post might as well be in Ancient Greek for the amount of sense it makes to you.

Regular expressions map directly to Finite State Machines (FSM). FSM, also known as Finite State Automata (FSA), are models of the behaviors of a system or a complex object, with a limited number of defined conditions or states, where state transitions change with circumstance.

In the particular regex found in Abigail’s JAPH, the left part of the expression fits perfectly in a Deterministic Finite State Machines model and can be represented with a State Diagram:

On the contrary, the brute-force happening on the right side can’t fit into a FSM model since it uses a backreference:

\1

The rule is that you cannot write pure regular expressions with backreferences, since mathematically-regular languages simply cannot include them: a backreference makes the language non-regular. The inclusion of backreferences moves us out of the regular languages and into what it’s called the "language of squares".

So, how is it that the expressions used to find composites number works? Well, the .NET regex engine (which is compatible with the Perl 5 machine used by Abigail), extends the basic regular expression language and it can be defined as Turing machine compatible. In this context, backreferences can be seen as a feature that allows the following diagram to be working successfully while it should not:

FROM PERL TO POWERSHELL

Being the .NET regex engine compatible with the Perl 5 regex engine, this expression can be ported into Powershell code as it is.

The negated match !~ translates into the –notmatch operator.

Here’s a first version of the translated code:

13|?{'1'*$_-notmatch"^1?$|^(11+?)\1+$"}

Since 13 is not composite, it will show up in the output of the oneliner:

PS C:\> 13|?{'1'*$_-notmatch"^1?$|^(11+?)\1+$"}
13

Using the range operator, we can even output the list of primes in a shot:

PS C:\> 1..13|?{'1'*$_-notmatch"^1?$|^(11+?)\1+$"}
2
3
5
7
11
13

For the purpose of code golfing, we can shave off a few more chars, using a syntax I like a lot, based on the Select-String cmdlets with the –notmatch parameter (-n):

PS C:\> 1..13|?{'1'*$_|sls "^1?$|^(11+?)\1+$"-n}
2
3
5
7
11
13

How about getting the same result using the [regex] class when we know that no negative matching method is offered?

Well, we know that an IsMatch method is available, which takes two parameters: a input in the form of a string, and a regular expression pattern in the form of a string as well.

So, to find composite numbers we just have to put up the following oneliner: