The New 16-Cog, 512KB, 64 analog I/O Propeller Chip - Part 2

Comments

Oh man, a long based processor would be very nice, simple and clean. All addressing just like COG addressing on P1...

Honestly, the COG address space is just going to be different. Adding the operator to specify that difference makes sense. I for sure don't feel good about the lengthy shift characters being everywhere. Laborious and error prone.

If all addressing is longs, there are lots of shifts and masks everywhere. Having that byte granularity in the HUB sense is best, to me at least.

And I don't get it at all. Why such a fuss over things that make a lot of sense in the context of how Propellers have been programmed so far?

The one gripe was no standard tools, and we got gcc for that and it works too. Those that really felt that was best use gcc and SPIN / PASM are just fine orherwise.

This Prop should play out just the same. We will have a bunch of gcc users looking for those tools, as well as people using SPIN / PASM.

on Arm's everything is 32bit aligned,
there is only two instructions LDRB and STRB that can handle bytes and they only work together with a register,
So most of the time it's two instructions to get anything done.

The few times you need bytes on P2, could you not use a special register that you have to write it to and then read from that will shift+mask.
e.g one extra step to get byte from a long, but not a big deal

Rdlong specialbytetranslaterreg, hub address ' you need to read a byte in to the cog anyway.
mov label, specialbytetranslaterreg wz wc ' wz/wc sets the 1 of 4 mask location and always shift down to bit0-7

As that "new user" so many years ago, I was able to jump on a P1 and get things done in a DAY. Nothing about it was hard, and lots about it was a lot of fun.

I really want that same overall feel for P2 with SPIN and PASM. Ideally, the on chip system will complete the picture with the whole thing one design, made to operate together, etc...

I absolutely agree! And I want that again too!

However, you and I are at an advantage here: we both already have P1 PASM under our belt and we both have been deeply involved in the P2 design. Even without an FPGA image to work with, I guarantee both of us can write advanced P2-style PASM. We can never be "new users" again, even for the P2.

This will not be the case for someone who has never looked at a Propeller before. The P2 is already must more advanced (and complex) than the P1 is. The learning curve is going to be steeper. I'm just concerned that it's also going to be less fun.

...This has come up a pile of times before. And I'll say it again, if SPIN and PASM didn't make the great sense they did, I would have passed on this chip in a second, never thinking twice. It's really important that we leave SPIN and PASM to it's creator, who is Chip, and let him do what he does with languages and tools.

Yes, that is different, and that is precisely why a lot of us like using those tools and languages.

Bear in mind, one of the design specs is "fun to use"

Absolutely Agree!

I love the simplicity of PASM.
Spin syntax (the short operators like != etc catch me often) I don't enjoy so much. I'd rather a more basic like syntax. But I do like the enforced indentation (as long as the IDE marks it like PropTool can).
BTW There could always be an alternate syntax giving the same bytecode output.

I am really an Assembler Programmer. However, I only do PASM in the P1 when required. If there is no speed issue, the Spin is actually easier.

When I "accidentally found" the P1 I was over-awed (if that's a word) with it's capabilities, multicore and no interrupts. I immediately ordered a ProtoBoard or 2. Then I had to wait. Meanwhile I started programming.

When it arrived, I had my blinking LED program running in minutes!. You cannot do that with any other chip that I know of (and I have programmed a lot of them).

One thing is certain, to me anyway - PASM + SPIN2 will go together well. They will be used by lots of newbies to just get started.

And, +1 for wanting some short form macro ability, at least for the ALT/AUG+JMP/CALL/RET instructions.

Seairth, jmg, etc.
I STRONGLY disagree with you. In fact, I argue that doing it as I suggested makes it MUCH easier for the new person. Having to put addr/4 or addr<<2 all over you code and know which to use when it extra complication just because the actual stuff is byte addressed, but most of the opcodes only contain 9 bits for cog addressing so you need to /4 the values, however some of them expect the larger 20bit address.

You misunderstood what I was saying. I wasn't arguing that we should keep "/4" instead. I'm saying "##" is obscuring the fact that you have to do "/4" at all!

Here's the way I see it:

1. Hub memory supports instructions at any byte offset. Because of that,
2. Instructions in cog (and now LUT) memory are also treated like they're byte addressable to keep things consistent (even though cog/LUT instructions must always be long-aligned). Because of that,
3. Cog/LUT instruction addresses have two extra bits that must be dealt with. Because of that,
4. We have to do things like "/4" and "<<2". Because of that,
5. We add "##" as syntax sugar.

Each step adds complexity to address the complexity before it. Instead, if we treat the instruction addresses as long offsets (including in the hub), that entire list above goes away. To me, that is simpler. That is easier to learn and to understand. That is more fun.

The few times you need bytes on P2, could you not use a special register that you have to write it to and then read from that will shift+mask.
e.g one extra step to get byte from a long, but not a big deal

It is things like packed records where byte granularity is important. If you have multiple COGS working in the same memory, you need that as Atomic Granularity too.
It can be a pain to mix, but (as Chip says) I do not see much choice ?

The whole conundrum is in supporting less-than-long data (words and bytes). They need extra bits to resolve addresses among longs.

It would be great to make a machine that is just long-based - what a relief that would be! Supporting words and bytes, though, requires those extra sub-bits. Then there's the issue of how to handle the addressing scheme which must involve all three sizes.

Chip, I think I must be missing something about the new design. I understand that the address lines to the hub memory must have the lower two bits so that you can address individual bytes. What I don't understand is why this also affects instruction fetching. If instruction addressing (not data addressing) in in longs instead of bytes, then I would think that:

* pc[8:0] would exactly match the cog or lut address lines (depending on pc[9]).
* {pc[16:0], 2'b00} would exactly match the address lines to the hub memory.
* pc[19:17] would be reserved for future expansion (assuming you didn't implement the other suggestion I made above)
* pc would increment by 1 regardless of execution mode.

I don't see how this affects or is affected by supporting less-than-long data addressing.

Please don't add more address operators. This is just hidingobscuring complication with syntax sugar. This makes PASM more difficult to learn for new people. I'm sure some of you will disagree, but youneed to remember that you have an entirely different perspective of the P2 than a new person will.

And, personally, I think it makes the Propeller less fun to program for. I'd much rather get rid of the complication and keep the fun!

I agree. I was just revisiting Prop2-Hot and looking at its address operators. We are much simplified in this Prop2. Much of that simplification comes from not having alignment rules.

I am quite concerned about the Special Registers being located at COG $000+.

Cog RAM $000+ is often used for tables. Now that they cannot be "0" based means adding an extra offset value to get the table. While this is often not a problem, it is if the table is being continually used which will slow down the code.

Some examples...

1. Font table: I use a font table located in cog $000+ within the video generator cog. It is extremely timing dependant!

2. Vector table: Currently there is no other way, but in my faster spin interpreter I have a vector table located in hub. It will be much faster for this to be located in COG or LUT. If it's in COG $000 it will be much faster to decode each spin opcode. IIRC the average spin opcode uses about 50 instructions. Cutting just one instruction in EVERY op code will yield another 2% gain. With LUT-exec and stacks, we are going to see a dramatic improvement in spin execution time. Every bit of speed will help. I am also sure we will see other interpreters making an appearance on P2, as well as GCC

If Bill is around I would love to hear his opinion ???

Meanwhile, Chip may I suggest you just leave it as you now have it (Special Registers at $000+). This way we can check it out.
We all need an FPGA code release

The big advantage to putting those special registers at $000..$007 is that cog and LUT become one uninterrupted code space. It makes 1k-instruction programs much easier to write, as there's no interruption where those special registers used to be. So, no cutting your program in half all the time.

You can always use the LUT as a quick lookup table with zero-based addressing. The RDLUT is a 3-clock instruction, though, not a 2-clock.

It would be neat to have object-level and PUB/PRI-level control over whether Spin code is compiled or interpreted.

Now that SPIN won't be in the ROM, the nice thing is that it can be improved even after the P2 is released! I suggest adding a "_version" const (or something similar) to SPIN2 in anticipation of having a living language spec.

No need to orgh $1000 or whatever since the program counter will have a flag determining whether the code is in hub or cog/lut.

You will note that I have used SETQ to set the number of times the RD/WR-LONG/WORD/BYTE will execute. Thus the count only needs to be the number of long/word/byte 's that need to be copied. The Verilog will add 1/2/4 where needed (when executing the rdlong/etc).

I have also presumed since we now have contiguous COG/LUT that the SETQ could be changed (later) to allow a full copy of COG/LUT (ie can use 11 bits).

Is this possible since it is a lot easier that at present? And does it make sense, or am I missing something???

I am quite concerned about the Special Registers being located at COG $000+.

Cog RAM $000+ is often used for tables. Now that they cannot be "0" based means adding an extra offset value to get the table. While this is often not a problem, it is if the table is being continually used which will slow down the code.

Some examples...

1. Font table: I use a font table located in cog $000+ within the video generator cog. It is extremely timing dependant!

2. Vector table: Currently there is no other way, but in my faster spin interpreter I have a vector table located in hub. It will be much faster for this to be located in COG or LUT. If it's in COG $000 it will be much faster to decode each spin opcode. IIRC the average spin opcode uses about 50 instructions. Cutting just one instruction in EVERY op code will yield another 2% gain. With LUT-exec and stacks, we are going to see a dramatic improvement in spin execution time. Every bit of speed will help. I am also sure we will see other interpreters making an appearance on P2, as well as GCC

If Bill is around I would love to hear his opinion ???

Meanwhile, Chip may I suggest you just leave it as you now have it (Special Registers at $000+). This way we can check it out.
We all need an FPGA code release

I love PASM, it's by far the best ASM language I have ever used, and I have used a half dozen or more. It's super simple and consistent. I want the P2 version to retain that as much as possible while it adds all the new abilities.

I have totally lost track of the issues here but I totally agree with that. I hope PASM does not get messed up.

What I'm worrying about is how easy it will be to get P1 PASM working on the P2.

Why can't we just use longs on the outside (ie visible to the programmer)?

The only time we use bytes and words is with RD/WR-BYTE/WORD. So we really only need to worry about byte addressing is when referencing hub.

So what has happened to make us use byte addresses everywhere on the P2 ???
Everything was fine on the P1 so can't we do the same on P2?
I am confused!

The way the Prop1 addresses memory means that longs must be long-aligned and words must be word-aligned, while bytes can be anywhere. One thing that means is that you cannot have structures made up of mixed word sizes.

On the Prop2, there are no such limitations. There is only one issue where any type of hub alignment matters, and that is on fast r/w blocks that wrap - they must be long-aligned to wrap properly. In no other case does it matter, so it makes understanding hub memory dead simple. The ONLY place I see it being a pain is in reconciling cog and LUT longs, which each have a single address, with longs in hub, which take four addresses. That's why <<2 and >>2 come into play. Those could be cleaned up by the approach taken in the development tools, though.

You know that there is a one-clock penalty for reading/writing hub longs and words that cross long boundaries, but that minor penalty can be overcome by using long alignment, if you want. It is not necessary, though, and I don't see a reason to force it, as it would just introduce a caveat to where something can be.

I think what Roy said about insisting on byte-address-level reckoning for cog and LUT is the key to happiness (or peace, at least) here because it maintains consistency of understanding between cog/LUT memory and hub memory, at least size-wise. What SEPARATES cog/LUT memory from hub memory is another issue which touches on sensibilities.

About not wasting two bits of the PC by supporting non-long-alignment: Remember that we still need to have ANOTHER two bits beyond the PC's bits to reach down to words and longs. Those two bits must be encoded into the instructions for reckoning absolute and relative addresses. We are at 20 bits for those purposes and there are no more bits for bigger addresses in the opcode set. So, these two sub bits of the 18-bit PC, if you want to see them that way, total about 20 flops per cog, with 16 of them being in the 8-level PUSH/POP/CALL/RET hardware stack. They are not resource hogs and if we got rid of them, we would be forced into long-alignment for all instructions. That would be the only effect of getting rid of them. We wouldn't get a 4x-size hub memory map because we are constrained to 20 bits for byte-level addresses. However, if we totally got rid of words and bytes (which I've really though about), we could have a 4x-size hub memory map. Supporting words and bytes is a pain, but I realize that for many reasons they are vital. If we didn't have bytes, each of us would hit a wall as soon as we needed a memory-efficient mechanism to handle them. We'd be doing read-modify-writes on hub longs and pulling our hair out, knowing we were mired in the reinvention of an old wheel.

Why can't we just use longs on the outside (ie visible to the programmer)?

The only time we use bytes and words is with RD/WR-BYTE/WORD. So we really only need to worry about byte addressing is when referencing hub.

So what has happened to make us use byte addresses everywhere on the P2 ???
Everything was fine on the P1 so can't we do the same on P2?
I am confused!

That's interesting!

I wonder if we could reckon ALL memory by long-address and consider the two orphaned LSBs as fractions: 0.00, 0.25, 0.50, 0.75. actually, those could be expressed as .0, .1, .2, .3.

It would be a little weird to understand that some hub-exec code starts at xxxx.3, for example. But, that's life. I think that would really look strange to people. Perhaps just having the tools unify hub-addressing notions with cog/LUT realities would be best.

I'm confused, wasn't the purpose of having the cog memory long aligned so that the addresses could fit in 9 bits? How does the P2 work now?

Almost all 32bit processors enforce some kind of long alignment rules because memory fetches are long aligned.

There are two bits in the PC to resolve non-aligned code addresses in the hub. So, during hub exec, all the bits are used to specify the byte-start of the long instruction, while in cog exec, PC[10:2] feeds the cog RAM and the two LSBs are ignored.

I realise we need to keep the byte and word access to/from hub. But that is the only requirement where we need to see the lowest 2 bits. And they are only non-zero (presuming we must long align instructions in hub - and it is my belief this should be demanded) when we want to access bytes (00/01/10/11) and words (00/10).
But when we are referring to hub longs those bits should be 00.
In other words, words should be word aligned and longs long aligned, just as we have in the P1. That made sense and was easy to understand.
When we reference cog or lut, they should always be accessed as longs. If it's necessary anywhere (and I didn't see that in P1V code) then they should be hidden from the user and be 00.

IMHO I think the whole byte addressing idea came about because of hub-exec. But that should not have happened as the instructions should always be long aligned. I don't see any reason for them not to be. It's not like we have varying sized instructions as on some processors.

Therefore, the PC should only hold bits Addr[19:2] with bits[1:0]=00 assumed. We then just need a flag to indicate whether the address is in hub, or in cog/lut where cog and lut should IMHO be represented as contiguous addresses A[12:2] with A[1:0]=00 assumed. Of course D & S can only normally contain A[10:2] which we use as D[8:0] and S[8:0].

No need to orgh $1000 or whatever since the program counter will have a flag determining whether the code is in hub or cog/lut.

You will note that I have used SETQ to set the number of times the RD/WR-LONG/WORD/BYTE will execute. Thus the count only needs to be the number of long/word/byte 's that need to be copied. The Verilog will add 1/2/4 where needed (when executing the rdlong/etc).

I have also presumed since we now have contiguous COG/LUT that the SETQ could be changed (later) to allow a full copy of COG/LUT (ie can use 11 bits).

Is this possible since it is a lot easier that at present? And does it make sense, or am I missing something???

ORGH no longer needs to start at an offset.
SETQ sets a count (longs for rdlong).
ORG 8<<2 for cog no longer is in bytes, so ORG 8 can be used.

I was of the mind yesterday that I should expand RDLONG-repeat to automatically flow from cog to LUT. This would involve one more D bit in the RDLONG instruction and one more D bit in the SETQ instruction. Both could be done, but then I started thinking how it would booger up the instruction set for this single-purpose accommodation and I decided against it. It's still pulling at me, though. It would be nice to have a single means to load both cog and LUT. It could be as simple as this: