Tachyon Forth for P2 -FAT32+WIZnet- Now Smartpins - wOOt!

I've been a bit of a late starter with any P2 code but I've cobbled together a kernel that I have been using to learn a bit more about the P2 instruction set and the way PNut compiles code. Although I am not up and running yet it may be mostly a matter of porting much of the high level byte code across and adjusting to suit PNut. I've also found that some of the little condition code tricks we used on P1 don't work the same on P2.

Here is some high level bytecode compiled in PNut that I have used for a simple test:

So between the two _GETCNTs and stacking it takes $31(49) cycles which at 50Mhz and 2 clocks/instruction IIRC is 1.96us so it doesn't look too bad considering I am only testing functionality and I won't optimize it until it says "ok"

BTW, the byte aligned addresses necessitates the extra step of shifting the byte code value by 2 to get the correct address to jump to and also messes up the PNut compiled source as I have to use /4 after every bytecode reference. But I will work with what I've got until it is up and running.

Once you do that, 'RFBYTE D (WC,WZ)' can be used to read contiguous bytes, starting from startbyteaddress. RFBYTE means 'read fast byte' and it always takes 2 clocks. RDFAST initiates the read-fast mode. This doesn't work with hub exec, because hub exec uses the RDFAST mode, itself. That first D/# term in RDFAST tells how many 64-byte blocks to read before wrapping back to startbyteaddress (0= infinite). To make wrapping work, startbyteaddress must be long-aligned.

Once you do that, 'RFBYTE D (WC,WZ)' can be used to read contiguous bytes, starting from startbyteaddress. RFBYTE means 'read fast byte' and it always takes 2 clocks. RDFAST initiates the read-fast mode. This doesn't work with hub exec, because hub exec uses the RDFAST mode, itself. That first D/# term in RDFAST tells how many 64-byte blocks to read before wrapping back to startbyteaddress (0= infinite). To make wrapping work, startbyteaddress must be long-aligned.

WRFAST works the same way, and uses WFBYTE, WFWORD, WFLONG.

Does RDFAST block until the streamer is ready or does the first RFBYTE block if it's not ready?

Once you do that, 'RFBYTE D (WC,WZ)' can be used to read contiguous bytes, starting from startbyteaddress. RFBYTE means 'read fast byte' and it always takes 2 clocks. RDFAST initiates the read-fast mode. This doesn't work with hub exec, because hub exec uses the RDFAST mode, itself. That first D/# term in RDFAST tells how many 64-byte blocks to read before wrapping back to startbyteaddress (0= infinite). To make wrapping work, startbyteaddress must be long-aligned.

WRFAST works the same way, and uses WFBYTE, WFWORD, WFLONG.

I have been suffering the pain of "adocumentation" and see all these wonderful instructions in the summary but not at all sure of what they do. Some of these have changed from P2-hot and the descriptions of many others are buried in a myriad of tangled posts. But as I said I will work with what I've got to see what I can do although I am looking forward to the new image with long addressed cog memory etc.

As for RDFAST I will have to think about how I can use this feature although I do want to achieve functionality first so that I can have an interactive development and test environment with SD filesystem. Once I write an inline assemler I can then play with these enhancements and get a feel for what will work. Also I think that due to hubexec that there will of course be no problem in having PASM code definitions but in also having PASM mixed with bytecode.

Overall I'm pumped even though I don't expect silicon for a good year, so here's looking at making this a good year!

BTW, I have my kernel mostly running now after which I will add the high level bytecode.

Once you do that, 'RFBYTE D (WC,WZ)' can be used to read contiguous bytes, starting from startbyteaddress. RFBYTE means 'read fast byte' and it always takes 2 clocks. RDFAST initiates the read-fast mode. This doesn't work with hub exec, because hub exec uses the RDFAST mode, itself. That first D/# term in RDFAST tells how many 64-byte blocks to read before wrapping back to startbyteaddress (0= infinite). To make wrapping work, startbyteaddress must be long-aligned.

WRFAST works the same way, and uses WFBYTE, WFWORD, WFLONG.

WOW! This is an interesting set of instructions helped by the egg-beater.

We should be able to get a number of interpreters working fast!
And with the LUT for extra code it should permit the interpreters to perform fast too!

Does RDFAST block until the streamer is ready or does the first RFBYTE block if it's not ready?

IIRC there is a small FIFO, so the first read has to wait for a slot (of course) but thereafter linear-data is slot-free. Random data is not helped much, but designs that used a short Skip approach should benefit.

Does RDFAST block until the streamer is ready or does the first RFBYTE block if it's not ready?

IIRC there is a small FIFO, so the first read has to wait for a slot (of course) but thereafter linear-data is slot-free. Random data is not helped much, but designs that used a short Skip approach should benefit.

Does RDFAST block until the streamer is ready or does the first RFBYTE block if it's not ready?

IIRC there is a small FIFO, so the first read has to wait for a slot (of course) but thereafter linear-data is slot-free. Random data is not helped much, but designs that used a short Skip approach should benefit.

Does RDFAST block until the streamer is ready or does the first RFBYTE block if it's not ready?

IIRC there is a small FIFO, so the first read has to wait for a slot (of course) but thereafter linear-data is slot-free. Random data is not helped much, but designs that used a short Skip approach should benefit.

This is all true.

But does execution after a RDFAST continue immediately (in which case a too-early RFBYTE would need to block) or does RDFAST wait until the FIFO begins to fill from the hub before continuing?

Now that I've had a bit of a play I can see that PNut is limiting me somewhat in being able to massage the code into the best areas whereas I relied on BST with its features and listing to help me with the P1.

Not to be too deterred I am now working on a version that gets rid of the vector table and compiles 16-bit addresses in place of the bytecode. So that means a Forth instruction can jump to code anywhere in the first 64k, be that cog, lut, or hub. So high level definitions such as colon defs will have a CALL to the colon interpreter to stack the IP and load it up with the new address. All this is rather similar to a more conventional 16-bit (address) Forth as we now have a much larger code space to work from as we also have hubexec. Code outside of 64k can still be called by jumping via an instruction in the first 64k or I may just insist that code is long aligned so that I can address 256k of code directly with a 16-bit word. So all definitions are CODE definitions by default. This should make for a pretty snappy but still compact Forth.

Using words instead of bytecodes makes more sense for the P2 as we can can directly address the first 64k of memory or 256k as code when Chip changes the cog to long addressing and I also align high level defs to longs. As for timing I am finding that the words "FOR 1234 DROP NEXT" will take under 1us/loop @160MHz which is faster than the P1 although admittedly the P1 does execute this in 3.4us/loop at 80Mhz. There is room though for improving these figures as I haven't really made use of any special P2 features yet. There will be plenty of other speed gains simply because we can have more PASM instructions plus other things.

I can't see how I can use RDFAST though as I can't really make use of sequential access.

At this rate it won't be too long before I am ready to release Tachyon Explorer for the P2 and then we can have some fun. I could also include a P2 assembler so assembly CODE definitions can be created and tested interactively.

Does RDFAST block until the streamer is ready or does the first RFBYTE block if it's not ready?

IIRC there is a small FIFO, so the first read has to wait for a slot (of course) but thereafter linear-data is slot-free. Random data is not helped much, but designs that used a short Skip approach should benefit.

Does RDFAST block until the streamer is ready or does the first RFBYTE block if it's not ready?

IIRC there is a small FIFO, so the first read has to wait for a slot (of course) but thereafter linear-data is slot-free. Random data is not helped much, but designs that used a short Skip approach should benefit.

This is all true.

But does execution after a RDFAST continue immediately (in which case a too-early RFBYTE would need to block) or does RDFAST wait until the FIFO begins to fill from the hub before continuing?

RDFAST releases once it has data in the FIFO. That way, RFBYTE/RFWORD/RFLONG never wait for anything.

.....
At this rate it won't be too long before I am ready to release Tachyon Explorer for the P2 and then we can have some fun.

Nice!

I could also include a P2 assembler so assembly CODE definitions can be created and tested interactively.

This would be really cool !!!

The Forth environment lends itself to test out code easily as parameters can just be put on the stack and the results printed out interactively. That also includes timing the operation as well being able to see I/O effects with SPLAT.

Now even with the changes with the new image we are expecting anytime I still expect to be up and running by next week, or sooner

Now that I have played with a "wordcode" kernel I am looking at a subroutine threaded interpreter. The current method is to jump to a routine and then jump back explicitly to the runtime interpreter. By calling the routine and having it return to whatever called it is now possible with P2 although it only has an 8 level return stack. But that leaves me the option at compile time of compiling 16-bit wordcode addresses to be read by the runtime interpreter or compiling call instructions instead as every high level routine is entered as assembly code anyway, then there is no need to interpret 16-bit wordcode. So the subroutine threaded method takes up twice as much memory but also runs faster.

Now that I have played with a "wordcode" kernel I am looking at a subroutine threaded interpreter. The current method is to jump to a routine and then jump back explicitly to the runtime interpreter. By calling the routine and having it return to whatever called it is now possible with P2 although it only has an 8 level return stack. But that leaves me the option at compile time of compiling 16-bit wordcode addresses to be read by the runtime interpreter or compiling call instructions instead as every high level routine is entered as assembly code anyway, then there is no need to interpret 16-bit wordcode. So the subroutine threaded method takes up twice as much memory but also runs faster.

Just thinking really in case I needed to go deeper and lut-based operators would certainly do it as I take it that you mean the lut would be used for stack space. The thing is that the instruction pointer or IP which in reality is PTRA needs to be stacked whenever I enter a new "colon" definition, that is, another routine that is made up of word-codes. So having general stack operators would be nice.

Curious thing with subroutine threading is that I've tried using direct assembly calls for routines in hubexec and there is no noticeable performance gain over interpreting the 16-bit addresses using the runtime interpreter. So why use assembly since it takes twice as much memory then. I even replaced the FOR NEXT with DJNZ but it's much the same.

Of course we could code such a simple loop as pure assembly rather than calls to Forth words and the stack etc and this will certainly be the case for quite a few functions but it seems that there is not much reason to worry about this method for kernel words. Now foregoing calls to the Forth kernel and coding as we normally would (plus no REP) an assembly routine that same I/O toggle routine doing exactly the same thing results in a pulse period of 580ns.

Peter,
I'm not able to decipher what stacking levels are in use there. Was it intended to help answer Chip's question of whether there is a strong case of having the LUT for stacking or not?

Chip,
Another stacking variant that might be more palatable to all is for the CALLA/B, PUSHA/B instructions being able to stack to CogRAM and/or LUTRAM. Would that be feasible? I guess that cuts off a chunk of Hub addresses though.

Chip,
Another stacking variant that might be more palatable to all is for the CALLA/B, PUSHA/B instructions being able to stack to CogRAM and/or LUTRAM. Would that be feasible? I guess that cuts off a chunk of Hub addresses though.

Wow! Make CALLA/CALLB/RETA/RETB address sensitive, just like cog/hub instruction fetching is, so that addresses $200..$3FF use the lut, instead of the hub. I really like that, because it means we don't need a whole extra set of instructions and pointers to use lut for stack. Excellent!

Are there any other sleeper purposes for address sensitivity?

P.S. This scheme would work only for CALLA/CALLB/RETA/RETB, since they are dedicated instructions that could easily be redirected. PUSHA/PUSHB/POPA/POPB are actually cases of WRLONG/RDLONG, though, and to redirect them, as well, would mean that hub $200..$3FF would not be reachable for R/W. So, this would only work for calls and returns.