Author
Topic: Reflective memory (Read 576 times)

A number of years ago (okay, ancient history in electronics world) VMIC developed something called Reflective Memory for VME embedded controller systems. Basically it was a serially connected dual ported RAM. It was the "hot ticket" back in the 80s and 90s.

Has anyone ever heard of something like this ever being implemented on a microcontroller chip ?

The application has run out of internal RAM and Flash. It can be "divided" in two, with a relatively small amount (1KB ?) of reflective memory on two different chips. Reflectibe memory will work because data updates do not need to be instantaneous, but zero overhead is a requirement.

Need to remember that you can write to more then one memory chip at a time

With out this type of memoryEach CPU could have some local dual port ram with one port connected to system bus.Any write to this memory then has to be via the system bus to keep all local copies up to date.

With dual port ram chips you can do this if you prevent read while a write is taking place to same address location.

Each CPU could have some local dual port ram with one port connected to system bus.Any write to this memory then has to be via the system bus to keep all local copies up to date.

Very true.

The application is extremely cost sensitive. Basically, it is a single board (and currently single controller chip) computer. There is no bus. All memory is on the embedded controller chip. The fewer the number of pins used the better (the rest of the pins are used for data acquisition and control).

Your not giving much information! What is the MCU and clock speed your talking about ?How many pins have you free ?You talk about wanting close to DPR speed, so are you trying to compare serial with parallel access ?VME reflective memory has nothing to do with your problem unless you wish to implement a very pin hungry VME interface on your MCU! The company you are referring to make boards for VME systems not memory chips.

Your not giving much information! What is the MCU and clock speed your talking about ?How many pins have you free ?You talk about wanting close to DPR speed, so are you trying to compare serial with parallel access ?

I just want to know if it has been done before.

Ultimately, it would be a semi-custom chip and high speed serial, about the same speed as one lane of PCI-E, would be adequate. We are talking "chip-to-chip" transfers with no external transceivers/drivers.

VME reflective memory has nothing to do with your problem unless you wish to implement a very pin hungry VME interface on your MCU! The company you are referring to make boards for VME systems not memory chips.

Have a look at the XMOS processors. They have hard realtime scalable multicore hardware and software, where

"hard" means the IDE examines the optimised binary to define the maximum program times; there's none of this rubbish run-it-and-hope you see the worst case

"multicore" means up to 32 cores/chip (i.e. 4000MIPS/chip)

"scalable" means both on-chip and across chips

As an added benefit the I/O is "FPGA like", i.e. it contains multiple programmable clocks, SERDES for 250Mb/s per port, plus each port has timers defining when input did occur or output will occur with 4ns resolution.

Most importantly, the programming environment is theoretically sound, being based on Hoare's CSP (communicating sequential processes) and C (with the ill-defined bits omitted).

There are lies, damned lies, statistics - and ADC/DAC specs.Gliding aphorism: "there is no substitute for span". Retort: "There is a substitute: skill+imagination. But you can buy span".Having fun doing more, with less

Have a look at the XMOS processors. They have hard realtime scalable multicore hardware and software, ...

VERY INTERESTING ! The design is such that each task has it own core yet can communicate to other cores. I would like to know more about how their XCONNECT switch handles off-chip messaging.

Also I find the following statements hard to believe based on my knowledge of chip design

Each tile contains local SRAM memory, which is shared between all cores on that tile for code and data

Each scheduled core has an allocated slot to access the memory in a single cycle

The xCORE memory will always respond within the allocated cycle

Execution out of RAM is $$$ for large applications.

One BIG benefit for "reflective memory" is that you double the amount of RAM and Flash available to the total application. Yes, it will require some forethought on how to divide the tasks to best utilize these resources.

Clearly a "reflective memory" serial communication channel would be setup for full duplex. Differential drivers may not be required if the link is going to another device on the same board. Yes, this is starting to look like a SPI network on steroids, so clearly it could also be used for "intelligent" I/O chips (just map their registers into the controller chips memory space) and multiple channels could be used for devices with different latencies.

Have a look at the XMOS processors. They have hard realtime scalable multicore hardware and software, ...

VERY INTERESTING ! The design is such that each task has it own core yet can communicate to other cores.

They are, aren't they! Most MCUs are very similar to each other, and most languages are inherently serial, not parallel. It is rare to find some that are significantly different and, most importantly, unify the hardware and the software.

I have only "kicked the tyres" with a simple design, but I found their documentation remarkably simple, clear, and without strange "gotchas". I haven't found any bugs either

It is worth realising that the concepts are old and have stood the test of time: CSP is from the 70s, hardware for CSP is from the 80s (Transputer), software for CSP from the 80s (Occam), and XMOS xCore is a decade or so old. Many CSP/Occam concepts are re-materialising in modern languages such as Go and Rust (but I've used neither).

Prof. David May was has been involved in all of that, and has avoided past problems.

Quote

I would like to know more about how their XCONNECT switch handles off-chip messaging.

I am not an expert and have not investigated this, however I suspect I can offer a few pointers:

within a tile all cores/tasks share the same memory on a timesliced basis - think of SMT

within a tile, inter-task comms can be implemented either via a comms channel or via shared memory

across tiles, comms have to be implemented by copying memory using a comms channel; clearly this adds latency

xC ensures all that is transparent. There are restrictions to ensure correctness, e.g. no cross-task aliasing of memory

if comms can occur between tiles on the same chip, extension to comms between tiles on different chips is trivial. ISTR it requires 5 wires and a serialisation protocol, but you would be wise to verify that

both i/o and inter-task comms uses the same language primitives and xCONNECT. That works very pleasantly - just "think of" the i/o port as a different task.

FFI, dig around on the XMOS website and forum to find more information. I've found these documents particularly useful, but others are more directed at your questions.XMOS-DIY-USB.pdfXMOS-Introduction-to-XS1-ports_3.pdfXMOS-Programming-Guide-_documentation_F-2.pdfXMOS-XCC-Command-Line-Manual_X6904A.pdfXMOS-XC-Reference-Manual_8.7-[Y-M].pdfXMOS-XS1-Architecture_1.0.pdfXMOS-XS1-Assembly-Language-Manual_8.7-[Y-M].pdfXMOS-xTIMEcomposer-User-Guide-14_14.x.pdf

Quote

Also I find the following statements hard to believe based on my knowledge of chip design

Each tile contains local SRAM memory, which is shared between all cores on that tile for code and data

Each scheduled core has an allocated slot to access the memory in a single cycle

The xCORE memory will always respond within the allocated cycle

Execution out of RAM is $$$ for large applications.

This is aimed at hard realtime embedded programming, not general purpose systems. See digikey for available processors.

I would question the necessity of having very large memory shared between many cores/tasks. Everything I've heard leads me to believe that "high performance computing" is heavily based on message-passing between separate non-shared memory computers. I believe that general purpose systems will have to go that route, but it will take a generation of kicking and screaming by people wedded to existing languages and implementations.

Although the HPC-vs-xCORE/xC details are very different, many of the high-level paradigms are similar: if you can "think" in one, you can "think" in the other. (The same can be said of Java-vs-C#, for example).

Quote

One BIG benefit for "reflective memory" is that you double the amount of RAM and Flash available to the total application. Yes, it will require some forethought on how to divide the tasks to best utilize these resources.

The number of cores/tasks is a hard limit - with exceptions! Given some reasonable restrictions on how a task is coded, the compiler can silently combine several tasks onto the same core. Essentially this comes down to sequentially merging all the "setup()" parts of a task, and having the "while (1) {select...}" parts into a single select statement.

Quote

Clearly a "reflective memory" serial communication channel would be setup for full duplex. Differential drivers may not be required if the link is going to another device on the same board. Yes, this is starting to look like a SPI network on steroids, so clearly it could also be used for "intelligent" I/O chips (just map their registers into the controller chips memory space) and multiple channels could be used for devices with different latencies.

« Last Edit: December 08, 2017, 03:24:37 AM by tggzzz »

Logged

There are lies, damned lies, statistics - and ADC/DAC specs.Gliding aphorism: "there is no substitute for span". Retort: "There is a substitute: skill+imagination. But you can buy span".Having fun doing more, with less

If you are talking about bigger systems like a PC or up, maybe you can give RDMA a try? You can grab used single-port or dual-port Infiniband cards for fairly cheap, and for a small (2-node to 3-node) IB RDMA network you can use point-to-point connections instead of hunting down an expensive IB switch.

I am not an expert and have not investigated this, however I suspect I can offer a few pointers:

within a tile all cores/tasks share the same memory on a timesliced basis - think of SMT

within a tile, inter-task comms can be implemented either via a comms channel or via shared memory

across tiles, comms have to be implemented by copying memory using a comms channel; clearly this adds latency

Your first bullet bothers me ! If it is a fixed timesliced, then tasks that are small will waste a lot of time doing nothing. Also how do you deal with large tasks, or tasks that are triggered by an external event (i.e. an interrupt).

Your not giving much information! What is the MCU and clock speed your talking about ?How many pins have you free ?You talk about wanting close to DPR speed, so are you trying to compare serial with parallel access ?

I just want to know if it has been done before.

Ultimately, it would be a semi-custom chip and high speed serial, about the same speed as one lane of PCI-E, would be adequate. We are talking "chip-to-chip" transfers with no external transceivers/drivers.

Tie two ethernet MACs together and you have a fast and DMA capable data interchange system. In modern microcontrollers you usually have the ethernet RAM on a different bus so the DMA transfers from the ethernet controller and the ethernet buffer memory don't interfere with the microcontroller fetching instructions and data (except when accessing the ethernet buffer RAM). With a relatively simple CPLD/FPGA in between which acts as a HUB you could even create a system where data is shared between several devices.

« Last Edit: December 08, 2017, 09:34:49 AM by nctnico »

Logged

There are small lies, big lies and then there is what is on the screen of your oscilloscope.

Tie two ethernet MACs together and you have a fast and DMA capable data interchange system. In modern microcontrollers you usually have the ethernet RAM on a different bus so the DMA transfers from the ethernet controller and the ethernet buffer memory don't interfere with the microcontroller fetching instructions and data (except when accessing the ethernet buffer RAM).

Maybe ...

I think this still has too much overhead. I would have to seem some low level simulation of the ethernet controller to convince me that this would work.

The goal is ZERO overhead for each processor and no external-to-the-chip drivers.

I am not an expert and have not investigated this, however I suspect I can offer a few pointers:

within a tile all cores/tasks share the same memory on a timesliced basis - think of SMT

within a tile, inter-task comms can be implemented either via a comms channel or via shared memory

across tiles, comms have to be implemented by copying memory using a comms channel; clearly this adds latency

Your first bullet bothers me ! If it is a fixed timesliced, then tasks that are small will waste a lot of time doing nothing. Also how do you deal with large tasks, or tasks that are triggered by an external event (i.e. an interrupt).

Regarding the last bullet, how much memory is copied ?

Interrupts? What are they? You can't guarantee timings to the clock cycle if you have either interrupts or caches I don't know what you mean by a small or large task. A task is a unit of computation started by a message or I/O, and which has to be completed before the next message or I/O occurs.

Typically you have one task/core dedicated to an I/O peripheral; when there's nothing to do, the core sleeps. The task resumption latency is low; in my application I see it as being instantaneous (10ns), but I believe XMOS states <100ns. Think of it as having the RTOS in hardware.

Each core has a 100MHz clock and executes one instruction every 10ns. The chip runs at 500MHz. Thus there can be 5 100MHz tasks running "simultaneously" in a tile. If you have 8 tasks then the IDE will pessimistically assume they are all running at full speed and will indicate the obvious increase in execution time. (In practice, cores are often waiting for I/O or messages, and consume zero execution time).

The amount of memory copied is defined by the message you send; 1 byte upwards.

Logged

There are lies, damned lies, statistics - and ADC/DAC specs.Gliding aphorism: "there is no substitute for span". Retort: "There is a substitute: skill+imagination. But you can buy span".Having fun doing more, with less