Each iteration takes 0.75 microseconds; each 64K takes 49 millisecs plus 4.6 millisecs to show progress on a serial LCD. The whole program takes just under an hour.

Notes:
I did completely overlap the CORDIC time with the subroutine. Didn't worry about another cog over-writing my result.

I did not use the build-in serial driver; didn't even use the serial capability of the Smart Pin.

I did modify the P1 subroutine to use REP, and sure enough, it got faster. Two instructions in the tight loop versus three. But it doesn't matter because the instruction is exactly equivalent and takes only 2 cycles.

Re-inventing the wheel is not a waste of time if, when you are done, you understand why it is round.

The spin2gui samples/ directory contains a program called multest.spin2 that does timing tests for 32x32 multiply using a software multiply, CORDIC, and mul instruction. The software multiply has an early exit so it does well for small values. It takes between 28 and 981 cycles. CORDIC is always 70 cycles, and the 32x32 multiply constructed from 16x16 MUL instructions always takes 44 cycles (all of this is running from HUB, so in COG memory presumably the software and MUL implementations could go a bit faster).

The CORDIC of course has the advantage that it can be interleaved with other operations. But if for some reason you can't do that then using MUL to do 32x32 multiplies seems to be the best bet.

The nice thing about IIR digital filters is a regular structure requiring (ideally 32 bit) 4 multiplies per second-order-section, and you can pipeline multiplies every 8 instructions leaving just enough spare instructions for the signed correction and housekeeping. Thus each section is about 1 cordic pipeline delay so you can reuse variables per section.

I've written a Python script to generate the p2asm for this, each section (apart from first and last) has the form:

So the pipeline is used without stalls every other available slot. There is only enough time to do
sign correction for one argument to the multiply, so I make all the coefficients positive and flip add<->sub
as appropriate in the accumulation.

Where d1, d2 are the delay state (per stage), a1,a2,b1,b2 are the coefficients.
The scaling between stages is going to be a shift, once I've figured out how to compute the right
shift values for maximum headroom, and there ought to be one more multiply at the end to get
the gain precise (ie all the a0 coefficients are finessed into a series of shifts and this multiply)

Talking of multiplies, here's a cordic pipelined complex multiply, using fix8.24 format real and imaginary parts.
A complex multiply needs 4 signed multiplies that can all execute concurrently, and an add and subtract to tie the parts together.

Let A = xr*yr, and let B = xi*yi, since these show up in multiple places later.

Now let C = (xr+xi)*(yr+yi) = xr*yr + xr*yi + xi*yr + xi*yi. The middle two terms are the imaginary part of the result, and the first and last terms are A and B, which we already have and can subtract out.

So, zr + zi*j = (A - B ) + j*(C - A - B ).

The disadvantage is that your multipliers have to be one bit longer than your input numbers; this isn't a problem on the P2 if you're multiplying signed numbers, because QMUL is unsigned, so you already have to manage that extra bit anyway. See Wikipedia's article for more details.

Larger multiplies composed of smaller multiplies can be done similarly: just replace j with 2^32 or 2^16. I briefely mentioned in the other multiplication thread that they could use it for long multiplies.

The problem I see with the Karatsuba on the cordic hardware is that the most streamlined way to do signed
correction on a multiply requires the inputs to the multiply to have their original sign bits intact,
which the additions might break by overflowing. You'd be forced to restrict your signed values
to 31 bits.

Given a 32x32->64 bit unsigned multiply of the form

hi, lo = a * b

you can make it work as a signed multiply like this:

hi, lo = a * b
if a<0
hi -= b
if b<0
hi -= a

with most of the housekeeping overlapping the pipelined cordic (except a single instruction
to correct the value of hi.)

The alternative requires taking the absolute values and recording the sign of result, requiring a 64 bit
negate at the end and 3 or 4 instructions overhead before feeding the pipeline.

With the complex multiply the gain from 4 to 3 multiplies is only 8 cycles for the pipeline, but the
simplicity of not having to deal with overflows probably wins out.

You need to know where the file is. Mark's one above is a URL to a picture he's placed on another website.

Another way is to upload a new picture with the "Attach a file" right below the edit box and use that. Once uploaded a small icon of the picture will appear. Hover over it, there will be an "Insert Image" you can click on. Clicking that will add the pitcure's URL to your edit box.

"There's no huge amount of massive material
hidden in the rings that we can't see,
the rings are almost pure ice."