Updated May 18, 2018: Some more bug fixes both to multiple assignment and to PNut compatibility. Now able to build ROM_137PBJ.spin2 correctly (or at least, the same as PNut ).

Updated May 17, 2018: Fixed a couple of problems that were impacting PNut compatibility, and improved the syntax for multiple assignment (so the parentheses are no longer required in most cases).

Updated May 16, 2018: Updated with a new Beta of fastspin that has support for multiple assignment and functions that return multiple values. I'd also like to point to a simple GUI for fastspin + P2: https://github.com/totalspectrum/spin2gui/releases. It allows you to edit a .spin or .spin2 file, compile it, and download to the P2 hardware (thanks to Dave Hein for his loadp2 program). It's configurable and could (at least in theory) also be used for P1 development, but I haven't tested that since there are already so many P1 GUIs available . AFAIK the only competitor for spin2gui on P2 is PNut, and at the moment PNut only supports assembly code.

Updated May 10, 2018: My, how time flies... Anyway, now that hardware seems imminent I've gone back and updated fastspin for the v32b FPGA images. fastspin -2 is able to compile all of the samples from Dave Hein's p2asm, including the boot ROM, so the instruction set coverage is pretty good now (but obviously let me know if anything is missing!). The Spin compiler itself is less tested in P2 mode, but I've used it for some emulators and demos on the FPGA. As always, usage is command line only and is just:

fastspin -2 myfile.spin2

This will produce myfile.pasm2 (which is the converted PASM2) and myfile.binary. You can load myfile.binary with PNut, I think, but I use Dave Hein's excellent loadp2 program, included with his p2gcc package.

Updated March 7, 2017: Updated the compiler for the v16a instruction set. The instruction set coding is not tested very thoroughly, but simple programs do compile and run.

Updated May 18: Fixed several bugs reported in the forums. The new binary is attached as fastspin_beta4.zip. Note that fastspin.exe can produce code for either P1 or P2; to get P2 you need to specify the -2 flag.

Updated May 9: Fixed binary output of jumps in COG mode, and updated the push/pop code in the compiler to use postincrement/predecrement mode. The updated binary is attached. Source code, as always, is at github.com/totalspectrum/spin2cpp.

Updated May 7: Fixed a bug in the >< operator (REV works differently on P2) so that fft_bench works now.

Updated May 6: I've attached the beta version of the compiler. The DAT section parser works better (produces the same result as PNut for the inputs I've tested) but is still incomplete; for example, it doesn't understand the fancy ptra++ syntax for read and write. I've fixed a few bugs in the PASM output
too, and added a fastspin specific readme. Usage is pretty simple:

fastspin -2 fibo.spin

produces fibo.p2asm, which can then be loaded by PNut.

*** Original Message ***
Here's an alpha version of the fastspin compiler for P2. fastspin compiles Spin code to PASM; it otherwise acts very much like openspin, but has a few extensions (such as inline assembly between asm...endasm). This version of fastspin has a -2 option to produce P2 code (the output file will be named with a .p2asm extension).

This is labeled an "alpha" version because assembly code inside DAT sections is not always compiled correctly -- the P2 instruction parser is incomplete and buggy. Inline assembly does (mostly) work, because it's passed through to PNut.

Having said all the caveats, fastspin is able to compile simple programs (like the fibonacci demo), and it may be useful for putting together quick demos and tests of the hardware. I compiled and ran fibo like so:

fastspin -2 fibo.spin
PNut_v7 fibo.p2asm

In PNut I selected compile and run, then opened a terminal window to see the output.

Please let me know of any issues you find. I'm still working on the P2 support for DAT sections and hope to have that functional soon.

@Eric, I'll have to try fastspin when I get a chance. I haven't done anything on the P2 for a while, and fastspin looks very interesting.

@David, You didn't add a smiley to your post, but I'm sure you were just kidding. cspin could be used to convert C to Spin, and then compile it with fastspin. However, that would be very limited. We really do need GCC for the P2.

Just pretty basic peephole optimization, constant propagation, and inlining of small functions. On small sequential functions it can produce very good code (even better than GCC, since it's tuned for the Propeller architecture), but it doesn't have any of GCC's sophisticated loop optimizations or common sub-expression elimination, so in practice it usually won't keep up with GCC.

It doesn't do register assignment at all, it just allocates all local variables in unique COG locations. So large programs can run out of space in COG memory. That's something that can be fixed eventually.

Just pretty basic peephole optimization, constant propagation, and inlining of small functions. On small sequential functions it can produce very good code (even better than GCC, since it's tuned for the Propeller architecture), but it doesn't have any of GCC's sophisticated loop optimizations or common sub-expression elimination, so in practice it usually won't keep up with GCC.

It doesn't do register assignment at all, it just allocates all local variables in unique COG locations. So large programs can run out of space in COG memory. That's something that can be fixed eventually.

Register allocation is what always seems to stop me from creating register-based virtual machines. I should really spend some time to learn how to do it! :-)

Just pretty basic peephole optimization, constant propagation, and inlining of small functions. On small sequential functions it can produce very good code (even better than GCC, since it's tuned for the Propeller architecture), but it doesn't have any of GCC's sophisticated loop optimizations or common sub-expression elimination, so in practice it usually won't keep up with GCC.

so that's looking like the same opcodes, compilers should be able to optimise the above well ?

I've just done some tests using GCC on intel D2000, with rather erratic results. Close, but no cigar.
(Looks more like GCC issue than D2000 ?)

Tests on GCC give these possible outcomes, (release build) which moves around with editing other code (?!) :
(debug build gives expected inefficient two copies, but it is quite stable)

Function use seems more stable, but has bonus push/pop & bonus xor ?
No idea why clearing the upper-32b before imul is used, but that XOR seems 'mobile' in GCC, sometimes it pops up after imul, which is not where I would want it.

fastspin doesn't have common subexpression elimination yet, which is pretty much required to recognize div/mod combinations.

GCC has it in theory, but in practice I've had trouble getting PropGCC to combine a div and mod, even though we've told it that the appropriate functions produce both results. I think PropGCC4 sometimes gets it right, but PropGCC6 has had issues. There is a standard C function (ldiv) that can do both div and mod, though, which helps.

GCC has it in theory, but in practice I've had trouble getting PropGCC to combine a div and mod, even though we've told it that the appropriate functions produce both results. I think PropGCC4 sometimes gets it right, but PropGCC6 has had issues.

Not sure which GCC intel uses, but it seems to get the big steps right, then drop the ball in the details... much like you say..

There is a standard C function (ldiv) that can do both div and mod, though, which helps.

I needed to have 32*32 -> 64b then 64b/32b -> Div.Mod

P2 should be able to do this in the same number of ASM lines too, I think ?
I can't find numbers yet on those intel opcode speeds, but they are not likely to be fast at 32MHz sysclks.
P2 with Cordic delays, may be similar ? / faster ?

Spin doesn't have any way to express 64 bit values, nor does it have unsigned operations, so I don't think you could write this in Spin. You could use inline assembly in fastspin though, something like:

Spin uses indentation to determine when blocks end. There is no endif - you just unindent. So why do you have an endasm? Why don't you just indent your inline assembly under its asm block?

So that labels in the inline assembly could start at the left column. PASM code doesn't use indentation the same way Spin does, so I thought it was more prudent to explicitly mark the end of the assembly (it makes parsing easier too). There probably would be some way to stick with just indentation (maybe require labels to start with ":" ) but I'm lazy .

Spin uses indentation to determine when blocks end. There is no endif - you just unindent. So why do you have an endasm? Why don't you just indent your inline assembly under its asm block?

So that labels in the inline assembly could start at the left column. PASM code doesn't use indentation the same way Spin does, so I thought it was more prudent to explicitly mark the end of the assembly (it makes parsing easier too). There probably would be some way to stick with just indentation (maybe require labels to start with ":" ) but I'm lazy .

Eric

Given all of the work you've done on spin2cpp, fastspin, and lots of other things, I think it would be hard to support your claim of being "lazy"! :-)

So that labels in the inline assembly could start at the left column. PASM code doesn't use indentation the same way Spin does, so I thought it was more prudent to explicitly mark the end of the assembly (it makes parsing easier too). There probably would be some way to stick with just indentation (maybe require labels to start with ":" ) but I'm lazy .

Makes sense to me - because ASM does not follow Spin rules, you cannot use Spin rules to exit any ASM area.

I'm not sure. Does it depend on how many other COGs are using the CORDIC unit? With just one COG running and with some totally bogus inputs the whole subroutine call and return takes 201 cycles (this is with hubexec).

With just one COG running and with some totally bogus inputs the whole subroutine call and return takes 201 cycles (this is with hubexec).

As a rough check, I can get something like 6*32+4*2 = 200
ie guess 6 cordic lines need 32c each and 4 non cordic lines are 2c, harder to get an odd number tho ?

Yep, that is funny, but it's what I got when I ran it on my DE2-115. Maybe the read from CNT sees an odd cycle because it's picking up the value partway through an instruction? Or there's some kind of hub interaction.

Yep, that is funny, but it's what I got when I ran it on my DE2-115. Maybe the read from CNT sees an odd cycle because it's picking up the value partway through an instruction? Or there's some kind of hub interaction.

Sounds worth investigating.
Can you post the code in another thread, with an empty call as a comparison.
Chip, or someone else may know the answer?
If the reads are identical, it should not matter what phase effects are there, provided they are stable.

Yep, that is funny, but it's what I got when I ran it on my DE2-115. Maybe the read from CNT sees an odd cycle because it's picking up the value partway through an instruction? Or there's some kind of hub interaction.

Sounds worth investigating.
Can you post the code in another thread, with an empty call as a comparison.
Chip, or someone else may know the answer?
If the reads are identical, it should not matter what phase effects are there, provided they are stable.

Frankly I'm not too worried about it, but here's the code if you'd like to take a look.

Maybe odd is not as odd as I thought. I assumed everything was paced at Opcode rates, but the P2 can wait to SysCLK granularity, and that seems to be exactly what happens here.

I find this (bold added) - seems 39c is a normal delay thru CORDIC

"When a cog issues a CORDIC instruction, it must wait for its hub slot, which is 0..15 clocks away, in order to hand off the command to the CORDIC solver. Thirty-nine clocks later, results will be available via the GETQX and GETQY instructions, which will wait for the results, in case they haven’t arrived yet.

Because each cog’s hub slot comes around every 16 clocks and the pipeline is 38 clocks long, it is possible to overlap CORDIC commands, where three commands are initially given to the CORDIC solver, and then results are picked up and another command is given, indefinitely, until, at the end, three result are picked up"