Note that this part is not lazy, because I later want to reuse these two collections several times. This part is not the most elegant I came up with. It is much like using a (ending:) slicing sequence, but ending: is not splitting the groups where I want to...

Another option would be to count the number of equal powers, and use that in a (limiting:) slicing sequence, but limiting: is not taking a block argument...

Approximately the same as the non Xtreams version (+4%). That's pretty good.

But this version has a tricky bug: stitching is raising Incomplete if there is no substream, and this happens for some numbers. For example,

9 optimizedPrimeFactorFactorialViaXtreams

produces the groups ((3) (5 7)) with the powers (4 1). When the bitIterator generator is evaluating with bit = 2, there is no group having a power with this bit set, Incomplete is raised by stitching, and caught by the generator which closes the stream without the final term (3 raisedTo: 4). Thus it returns a wrong result... Ouch! That's one reason for I dislike Exceptions.

My single test on a huge number did not fail, and I only discovered the bug with the generator free version:

Overall, I find Xtreams powerfull, far more than Smalltalk Stream. But there are still things lacking. For example, the grouping was not obvious. The stitching of empty streams was surprising. I might need a readingStitching that works with an empty stream.

Xtreams is interesting both for its functional like aspects and for its potential use in parallelism, a lot like this. From this later point of view, I find it easy to pipe serial operations with Xtreams in case of a single assembly line. It also seems possible to fork processing for each element when they are independent (this is data parallelism - and could be applied for performing karatsubaProduct of each term in above example, and thus get a very efficient parallel version of factorial).
On the other hand, having two parallel assembly lines does not seem that easy. There is a duplicating: operation that should allow such parallel streams, but it necessarily uses a writeStream. I'd like the assembly chain to be as easy as:

Funily, I used unix pipe | to denote serial and ampersand & to denote parallel, but I find the contrary more natural: || makes me think of parallelism, and & of sequencing. Just a point of view...

From above point of view, the differences between read and write streams seems a bit arbitrary. Filters are inherently both. The Generator pattern could be used to turn any WriteStream (push) into a ReadStream (pull), but it's current implementation has potential dead lock conditions. But wait, it's possible I missed some features, or miss-used some others, Xtreams is evolving rapidly. There is a new post about using Xtreams with parallel VM, I have to inquire a bit more!

MessageTally is not reliable with COG VM, but clearly, only one third of execution time is spent in the primeFactorFactorialViaXtreams. It seems like something is wrong with this image, and implementing the generator via interprocess synchronisation did favour those CPU hijackers!

SmallInteger is not large enough for Karatsuba and falls back to regular multiplication *, we omit the code here.The copyDigitsFrom:to: and lowestNDigits: are just using a primitive for splitting a LargePositiveInteger, they are omitted too.

We'd also better optimize squared which is a simple case:

LargePositiveInteger>>squared
"Eventually use a divide and conquer algorithm to perform the multiplication"

Better, but there is room toward the prime swing variant which performs the job in less than 7000ms. Let's adapt our strategy and use a third trick: try and square as many LargeInteger as possible. One idea is to group the terms having same powers, multiply them together, then raise to the prescribed power. But if term p is raised to the power of 7, we will have to evaluate (p squared squared) * p squared * p. So we will store p into three collections,

the collection of terms raisedTo: 1;

the collection of terms raisedTo: 2;

the collection of terms raisedTo: 4.

The last point, is that we won't bother with powers of 2, because the fastest variants also don't. Here we come with this algorithm:

Integer>>optimizedPrimeFactorFactorial
"This is the optimized version of primeFactorFactorial"

Tuesday, July 5, 2011

In my last post I reported the lack of Generator in Xtreams framework.
Well, I just received this clarification from Michael and Martin, the authors of Xtreams.
Definitely, Xtreams has a lot of potential!

Monday, July 4, 2011

Last example algorithm, primeFactorsFactorial did involve enumerating of several collections:

The collection of primes up to n

The collection of powers of these primes

An alternative is to work with a stream that generates each item on the fly, rather than collecting the whole sequence of values, and then pipe to other kind of streams that transform the values. This technique has two advantages:

avoiding allocation of large auxiliary collections that will later be thrown out, can help some algorithms to scale up.

piping could theoretically enable parallel execution if we had a multi-threaded VM (though in fact we keep some collections, the optimal performance being often reached by tuning some buffers size, since the cost of inter thread synchronisation is not null)

This does not really apply to our prime factors decomposition since the collection of primes are not huge, but we have this example available, let's use it and see how it would be possible in Squeak.

First, Squeak provides an primesUpTo:do: taking a block as argument. That's all we need.

We can transform this block into a stream using a Generator. Let see in class comment how to use it:

OK, so a new Generator is associated to an outer block, [:g| Integer primesUpTo: 100 do:[:prime| g yield: prime]], which tells which method to execute to produce the sequence of values.

The inner block [:prime| g yield: prime] just tells the generator that a new prime is available.

Each time we ask for generator next, the generator machinery arranges to advance the execution of outer block until it reaches execution of the inner block, and then switch back and return the prime provided to this block.

This is known as the coroutine technique, and implemented by using a clever a simple hack manipulating the call stack. Look at the implementation by yourself.

Note that a Generator now understand value: and can be used directly in place of the inner block, try (Generator on: [:g| Integer primesUpTo: 100 do: g]) contents.

Hmm, it will not work, because a Generator understands only one iteration operator, do:, but not collect: Squeak streams are a pity.

More over, the collect: operation if implemented by means of do: would not be lazy but rather produce the whole collection of powers. We would have only eliminated the first collection of primes, but not the collection of powers of primes.

One good and recent framework - Xtream - It is an interesting alternative and provides the lazy operations we're looking for; follow the squeaksource link to get download instructions. Let's see how we can lazily pipe the operations through streams:

Unfortunately Xtream has no Generator, so we have to wrap it up in a BlockClosureReadStream.

Another problem is that injecting:into: is not reducing the stream to a single element, instead it answers a stream of all the intermediate results. Since we only need the last one, we have to repeat the operation by ourself, capture and answer the last element when the stream starves (we catch this Incomplete exception).

Once I proposed an alternative selector ^[stream get] repeatUntil: Incomplete, with a slight modification, this would have been useful; or maybe we could just extend XTReadStream to respond to last.

This is to be compared to the 19136ms of naive form. Well, that was expected, we eliminated the intermediate collections, but they did not cost that much in this case. Instead we pay a price for indirection and maybe also the Generator trick forces COG VM to un-optimize some portions. But that doesn't matter, we saw some smart streaming techniques.

A more difficult exercise would be to transform a divide and conquer recursive product evaluation variant into a streaming equivalence... Yet I do not know how, but if I find it, this will be another post.

In both cases, only the odd products are performed, in a divide and conquer fashion which produces well balanced large integers (this is important for accelerated products like KaratsubaToom or FFT). The final power of two being replaced by a bitShift: which saves many costly LargeInteger multiplications.

Our suboptimal variant of interest consists in evaluating the factorial from its decomposition in prime factors. We know that each prime <= n will be compose n factorial, but we also know the power of each of these primes thanks to Legendre. This algorithm can be naively transcribed in Smalltalk:

Oops! Unfortunately, it's not only suboptimal, it won't work for primes above 36 because Squeak has no idea which character to use to represent digit 37. Anyway, even with unicode we would only have a limited set of characters. Huge, but limited.

It does not matter anyway, we can just reimplement printStringBase: and instead of streaming out digits converted to characters, just sum the digits.

A naive algorithm would be to just remove the last digit (the remainder by a division by base):

Integer>>sumDigitsInBase: base
"Answer the sum

| sum |

sum := 0.

q := self abs.

[q = 0]

whileFalse:

[sum := sum + (q - ((q := q // base) * base))].

^sum

Since we probably won't compute the factorial of a LargeInteger, this shall be enough.

For LargeInteger, we would use a divide and conquer method in order to avoid repeated LargeInteger divisions, and replace them by as many SmallInteger divisions as possible. This would be the exact image of printStringBase:, so we won't develop it here.

Now we are done with this very naive implementation. Let's see how it performs on COG VM:

Of course we could stop the loop at n sqrtFloor, increment by 2 from second loop etc...

We could also enumerate primes by feeding a block with p inside the ifTrue: [ ].

But let's have fun by enumerating primes from the composite mask with bit tricks.

For example, we want to collect primes in an Array.

To know the number of primes up to n, we just need to count the number of bits set to 0.

Or subtract n with number of bit set to 1.

numberOfPrimes := n - composite bitCount.

Then, we just enumerate by finding and removing next bit set to zero, starting at least significant bit.

(Array new: numberOfPrimes) collect: [:void |

| isolatedBit |

"find next zero in composite"

isolatedBit := composite bitXor: composite + 1.

"remove next zero in composite"

composite := composite bitOr: isolatedBit.

isolatedBit highBit]

Indeed, 2rxxxxx01111 + 1 will generate a carry up to trailing 0, and result in 2rxxxxx10000, the leading xxxxx being unchanged.

The bitXor: then will remove the leading xxxxx, and result in 2r11111.

We then just need to get the rank of the highest bit, which is highBit's job...

An alternative is to first reverse the bits of composite mask.

Here we will just print the primes in the Transcript:

composite := 1 << n - 1 - composite.

[composite > 0] whileTrue:

[Transcript cr; show: composite lowBit.

composite := composite bitAnd: composite - 1]

Same here, 2rxxxxx10000 - 1 will generate a carry up to trailing 1 and result in 2rxxxxx01111, the leading xxxxx being unchanged. Then bitAnd: will just clear all but the leading xxxxx.

Of course, this method is not really efficient because each bitOr: and bitXor: operation will allocate a new LargePositiveInteger, not counting all the bitShift:, additions and subtractions, but do we care? Squeak Integer class>>primesUpTo: is already fast, we don't need another one...

Nastily, I removed the comments, obfuscated the names, and factored out repeated code with additional temps. It's not a good idea because some operations leaked out the primality condition block, but just for the pleasure to quizz colleagues:

quizz: aBlock
"Could you tell which serie will feed aBlock when this message is sent to an integer n?