The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Comments

Because the FIFO is sequential and not randomly accessed, it must be reloaded on every jump to hub memory, even if it contains the needed code.

I wasn't talking about randomly accessing the FIFO. But it occurs to me that my thinking was still wrong. Because of the pipeline, the instruction that would be returned to would have already been removed from the FIFO anyhow. Ah well...

Hehe, if only. I think the danger in calling them smart pins is everyone expects them to do pretty much everything, which would be a mighty big ask.

I think Chip mentioned recently the P2 logic might grow by 10k flops or so, for the smart pins, which is a budget around 150 flops per pin. You probably need 100 flops just for working and config registers (a la P1 counters), but I don't doubt mighty useful things can be achieved, but its quite a thought exercise.

edit: might be worth Chip starting a new thread as per Ken's request, in case that helps the forum server

So even a small REP loop will be repeatedly fetching from HubRAM, right? Which effectively means HubExec always consumes about 50% of that Cog's available Hub bandwidth.

And that's the advantage of the egg beater. If this performance plays out nicely, it's a very good thing, because it means we get a lot of pretty speedy COGS without all the complex interactions on the "hot" chip.

It was mentioned earlier that a jump will cause the streaming FIFO to refill, even if the jump address matches the top of the FIFO. I would guess that a REP would also cause the FIFO to refill as well. So, yes there would be stalls. The streaming FIFO is just a FIFO and not a cache. Hubexec will efficiently execute straight line code, but I don't think it will be very efficient with loops and jumps. This concerns me a bit because C code tends to have lots of jumps and function calls.

Chip,
What can the LUT (256 longs) be used for besides CLUT (and obviously lookup tables since you mention sin/cos tables)?

There's a Goertzel algorithm circuit in the streamer which can output the lower two bytes of the LUT longs to DACs, while the top two bytes are treated as sine and cosine for accumulation. On each clock, the 32-bit phase is added to, the LUT is read using the upper bits of the phase as the address, then the looked-up top two bytes are each multiplied by an ADC feedback bit (0/1 --> -1/+1) and accumulated separately. After many cycles of this, those accumulations represent an (X,Y) point that expresses angle and amplitude. The CORDIC instruction ARTCAN converts (X,Y) into (ro,theta) so you get power and phase angle. Those bottom two bytes from the LUT were output to DACs as a stimulus, if needed, to excite some system through which the ADC returns a bitstream. When used in this closed-loop mode, you have an instrument which should be able to resolve all kinds of interesting things in the real world that relate to resonance, time-of-flight, phase differences through sensor arrays, and who know what else. Something new to play with.

Bill, some of us would like to see an FPGA image as soon as possible. And Chip is working toward a deadline of getting the Verilog to Treehouse by November 1st. I don't think random suggestions for changes to P2 are helpful at this point.

It was mentioned earlier that a jump will cause the streaming FIFO to refill, even if the jump address matches the top of the FIFO. I would guess that a REP would also cause the FIFO to refill as well. So, yes there would be stalls. The streaming FIFO is just a FIFO and not a cache. Hubexec will efficiently execute straight line code, but I don't think it will be very efficient with loops and jumps. This concerns me a bit because C code tends to have lots of jumps and function calls.

REP is not allowed in hub exec, as it would be a pain to make work and pretty much defeat the purpose of its efficiency. So, REP is only usable within cog exec.

There will be lots of hiccups branching around in hub code, for sure. An instruction cache could help that, in cases of tight loops, but I don't want to go there. I just figure that hub exec, in itself, is miraculous enough for this chip.

Chip,
What can the LUT (256 longs) be used for besides CLUT (and obviously lookup tables since you mention sin/cos tables)?

There's a Goertzel algorithm circuit in the streamer which can output the lower two bytes of the LUT longs to DACs, while the top two bytes are treated as sine and cosine for accumulation. On each clock, the 32-bit phase is added to, the LUT is read using the upper bits of the phase as the address, then the looked-up top two bytes are each multiplied by an ADC feedback bit (0/1 --> -1/+1) and accumulated separately. After many cycles of this, those accumulations represent an (X,Y) point that expresses angle and amplitude. The CORDIC instruction ARTCAN converts (X,Y) into (ro,theta) so you get power and phase angle. Those bottom two bytes from the LUT were output to DACs as a stimulus, if needed, to excite some system through which the ADC returns a bitstream. When used in this closed-loop mode, you have an instrument which should be able to resolve all kinds of interesting things in the real world that relate to resonance, time-of-flight, phase differences through sensor arrays, and who know what else. Something new to play with.

It would save an instruction, right? I don't think it's worth doing at this point. But, keep thinking.

There will be lots of hiccups branching around in hub code, for sure. An instruction cache could help that, in cases of tight loops, but I don't want to go there. I just figure that hub exec, in itself, is miraculous enough for this chip.

I would expect software tools to help a lot here, as there is always a mix of COG and HUB codes, so smallest most critical code (and certainly interrupts) would go into COG, and then a more elastic amount of not-inner-loop code can go into HUBEXEC.
Manage and control of that mix and thresholds is a software tools problem.

It's been a long while since I've been back here hoping for current status of the Propeller 2 to be sold by Parallax. Where can I check the latest information (without searching through a ton of posts)?

tdg8934,
- about 7 weeks out from submission of P2 verilog for synthesis
- Parallax 1-2-3 A9 FPGA board soon to be available for $475. Smaller A7 (10 cogs) $375 available now
- P2 FPGA image to be released very soon for 6 or so target boards, including the above

We'll only need about 300k gates, so planning for an 8 x 8 mm die with a 0.75 mm thick pad ring will give us all the room we need.

Sorry for the false alarm. Thanks for all your inputs. The consensus seems to be that big RAM is important.

P.S. You can figure the area savings of eliminating 8 cogs by cutting in half the 512x32 and 256x32 RAM areas (that comes to 2.35 and 0.75 mm2) and taking away the area of 150k gates (150k/827k x 10.6 mm2 = 1.92mm). Summing that all up: 2.35 + 0.75 + 1.92 = 5.0 mm2. That only saves about 8% of the die area, which is not much. Better to keep 16 cogs.

I had a two hour Google Hangout meeting with Treehouse today and we went over the whole pad ring layout for Prop2. They did a really good job on it. Everything was in good order. I had them make the outer dimensions 8.5mm x 8.5mm, which is more than we should need ever need, but will guarantee that we won't have to remove any features at the last minute during synthesis. I should have taken some screenshots. I'll ask them for some and will post them here.

I had to do a lot of Googling to find this post I quoted here. I was wondering just how much more room we have now with a bigger die.

There's not enough room to double the hub RAM to 1MB, but we could do something like switch the 512x32 SP LUT RAMs to dual-ports (like the cogs have) at a cost of 2.3 mm2. That would enable outputting from one LUT section while updating another, without any glitching.

Wow, doubling HubRAM again! If that really fits I think many people would vote for it over other features. It begs the question though, given 72 - 25 = 47, how much cheaper would a 512kB HubRAM 7mm x 7mm die be?

"There's no huge amount of massive material
hidden in the rings that we can't see,
the rings are almost pure ice."

There's not enough room to double the hub RAM to 1MB, but we could do something like switch the 512x32 SP LUT RAMs to dual-ports (like the cogs have) at a cost of 3.2 mm2.

Do you mean so they would dual-port to adjacent COGs ? or just be able to extend std COG memory ?

I just meant that the streamer would be able to access the LUT from its own bus.

What you are talking about would certainly solve the cog-to-cog communication problem!

That would mean that there would be only 8 LUT blocks (one per pair of cogs). That may still be reasonable. Could you then go to 4-port RAM and provide both glitch-free streaming and cog-to-cog access?

Edit: also add one more event for writing to LUT address $1FE. That way, the paired cogs can use $1FE and $1FF for signaling.

If we could squeeze another 256KB of hub ram, this would be worthwhile.

For COGs, 16 of 32x32 DP RAM between adjacent cogs would give some serious and fast comms between cogs. Coupled with an interrupt, it would be even better. 16 of 32x32 DP RAM = 16 x 0.292 / 16 = 4.7 / 16 = 0.29 mm2.

For COGs, adding another 4KB or 8KB of SP RAM (2.4 or 4.8 mm2) would be nice. It would give a huge LUT for streaming, and a serious boost to cog/lut-exec space.
Presuming an additional 4KB, that would give 2 of 4KB blocks of LUT. It might be possible to be filling one 4KB block while streaming from the other 4KB block (ie pseudo dual-port).