KeyStone I training: multicore navigator overview

Welcome to the Keystone Multicore Navigator training. This training consists of three parts. This is the first part, which takes a look at the Multicore Navigator components.
This overview takes a look at two primary areas-- what is the Navigator, and what can the Navigator do? The first section, what is the Navigator, will provide a definition of the Navigator itself, take a look at the architecture, focus primarily on its two main components-- Queue Manager Subsystem and Packet DMA-- and we'll also take a look at descriptors and queuing. As far as what Navigator can do, we're going to take a look at three functions-- data movement, inner processor communication, and job management.
Let's start by taking a look at what is the Multicore Navigator? First, let's take a look at the definition of the Multicore Navigator. As the name implies, Multicore Navigator is a collection of components that is designed to facilitate the movement of data as well as the control of the multiple cores within the Keystone devices. There are several primary components within the Navigator domain. These include a Queue Manager, a series of specialized packet DMAs known as packet DMA, the data structures which describe the package and are called descriptors, as well as a set of APIs and registers which can be used to manipulate the descriptors in hardware.
Navigator is the primary way in which data moves through the Keystone devices. It's designed to be a fire and forget system. When we say fire and forget, what we mean is, up front, you do a lot of work to configure and initialize the system. And at runtime, you push and pop descriptors, and the system handles the rest without any CPU intervention.
Now, let's take a look at the Navigator architecture starting with the block in the lower left hand corner, the packet DMA. As you can see, there are multiple instances of the packet DMA shown here. The reason being is packet DMA is a distributed system that exists within multiple IP within the Keystone devices. Within the packet DMA, we can roughly divide the functionality into three sections. First, there's the transmit side, then there's the received side, then there's the streaming interface control.
On the transmit side, the transmit schedule controller is activated when an outside IP pushes a descriptor into the correct queue that corresponds to a transmit channel. When this happens, the queue manager activates the queue pend signal that corresponds to the IP who has pushed the descriptor into the queue. Once the packet DMA channel has been activated, the transmit scheduling control activates the transmit core to begin buffering data across the bus from memory into the transmit channel FiFo. The payload data stays in those FiFos until the IP control block handshakes properly with FiFo control to allow the data to be transferred from FiFo control to the IP block, and we'll talk a little bit more detail about that process as we move through this training.
Now, let's take a look at the receive side. The first component that's involved on the receive side is the Rx channel controller in FiFos, which received packets from the FiFo interface. The Rx core is responsible for transferring packets from FiFo back into memory, and it is the responsibility of the Rx coherency unit to make sure that all memory rights are completed before the Rx core sends the final destination descriptor.
The third element of the packet DMA is the streaming interface. The streaming interface has two 128-bit buses. Each has a transmit and a receive side.
Data entering the IP goes through the transmit side-- that is, it goes from the transmit core. The transmit core pulls the data from memory, sends it through the transmit channel controller and FiFos to the backup DMA controller in the IP. Data blocks coming from the IP go from the packet DMA controller instance in the IP through the receive side streaming interface to the Rx channel FiFos, and the Rx core takes those data blocks and writes them to memory.
Next, let's take a look at the Queue Manager, which is in the lower right-hand corner of this diagram. One of the primary components of the Queue Manager is the Queue Manager itself, which actually maintains 8,192 separate queues, and it maintains these using Link RAM. The internal Link RAM can support up to 16,000 descriptor entries. In addition, Queue Manager also has, out in memory, an external Link RAM which can support up to half a million descriptors.
As you can see, the Queue Manager Subsystem also includes its own instance of the packet DMA, and this is a special instance of the packet DMA known as the infrastructure packet DMA. It's special because it has a streaming interface that is connected to itself and looped back. What that means is everything that's read into the transmit side is sent back out on the same clock on the receive side. The Queue Manager also includes two AP DSPs, which are used for descriptor accumulation. Basically, it's firmware that runs in the background, collects descriptors that are in queues, pushes them to a list in memory, and depending on how it's programmed, when certain conditions are met, the Interrupt Distributor generates queue interrupts, which can be accessed by the host software.
And finally, looking at the upper part of the diagram, the host sees the Queue Manager and packet DMA registers and accumulator command interfaces as regular read-write accesses across the VB USP. The descriptor memory, buffer memory, and accumulation memory are normal memory accesses. The link RAM is readable by the host, but should not be in most cases as it's a private block that's designed for use by the Queue Manager. Queue interrupts to the host are provided by the chip level interrupt controller and the DSP interrupt controller.
All right, let's take a look at the next level of detail here for the Queue Manager Subsystem. Some of this will be a review of what we just talked about in the last slide. So the Queue Manager has a total of a 8K hardware queues, and some of those are dedicated to queue pend signals. Queue pend signals are used to drive transmit channels and packet DMAs. Some of the queues that have queue pend signals also drive chip level interrupt controllers.
There are 20 memory regions that are used for storing descriptors. Those are in local memory as well as the MSMC and external memory. And as mentioned, there are two linking RAMS-- one internal, that can hold up to 16K descriptors, and then there is a second external linking RAM that can either be in L2 or DDR memory. And those can hold up to 512K descriptors, another half a million descriptors in total.
Now, each memory region can support a maximum of 32K descriptors, and with 20 regions at 32K each, that's a total of 640K. However, the limiting factor here is actually the number of bits that are available in linking RAM to store the link indexes. So the major components, just to review again, are the Queue Manager, the packet DMA; we have two packet data structure processors. These contain firmware that is used for descriptor accumulation as well as the monitoring of queues. In addition, there is firmware that is available for load balancing and traffic shaping and the Interrupt Distributor module, which is a module that is specifically tailored to the Queue Manager subsystem and is used by the PDSP firmware to generate interrupts, but as an interesting side note, it can also be used independently for generating interrupts, which is a nice feature.
This table shows the queue mapping, which basically maps the 8K hardware queues to their specific functions. Taking a look at the hardware column, you'll see there's two primary types of queues. There's the PDSP firmware queues, which is queue 0 through 511 are dedicated to low priority accumulations, while queues 704 to 735 are dedicated to high priority accumulations.
The queue pend queues are queues that have a specific hardware line that connects them to a transmit channel in a specific IP. For example, queue range 512 to 639 contains 128 queues that connect to the antenna interface module-- the transmit queues for the antenna interface module. In addition, you can see that the SRIO has the queue range 672 to 687, 16 channels. And just to understand, queue range 672 activates transmit channel 0 to the SRIO, while queue range 673 activates transmit channel 1, and they continue that way in order.
And besides those two groups, we also have general purpose queues. As you can see, they're range 696 to 703 and 896 to 8191. There's a great number of these general purpose queues, which can be really used for just about anything, but they're mostly used for free descriptor queues, which are pools that you create and draw from when you want to load data into a transmit channel or receive data from a receive channel. So it's also worth noting that, while the queue pend queues should not be used for any sort of general use, any of the other queues that are not being used, where those functions are not being supported-- for example, the low priority accumulation queues-- they could be used for any general purpose, including for free descriptors or anything of that sort.
OK, moving on, let's take a look at the packet DMA topology. Specifically, what we have here is the fact that packet DMA, as I've mentioned before, is a distributed system. There are actually six instances of packet DMA that can be present within the Keystone devices. There are two, in FFTC for Keystone wireless devices, as well as one in the antenna interface, also for the Keystone devices for wireless applications.
In addition, all instances of Keystone have a packet DMA instance in the serial rapid I/O and in the network co-processor. And of course, the sixth packet DMA is the one that is located within Multicore Navigator as part of the Queue Manager Subsystem. That is the infrastructure packet DMA.
Something that's interesting to note here is that, unlike most DMAs, packet DMA is not concerned with payload format. For example, EDMA has concerns about dimensionality, but this is not true in the packet DMA. The packet DMA is unconcerned about such things. It just sends bytes.
Another interesting thing about packet DMA is that the receive and transmit sides are fully independent. On the transmit side, the Tx core is triggered by the queue pend signals coming from the Queue Manager, so when a user or program pushes into a queue that is tied to a Tx channel, the channel for that core starts processing data if it's there. The behavior of the Tx core is determined by the fields that are programmed in the descriptors, so it's basically up to the user program to set those descriptor fields for the core, and then the core will act upon those fields as it sees them set. Within the Tx core is a four level round robin scheduler, and how this works is each channel can be set up to one of the four different levels of priority. And each clock, the core and the scheduler, [INAUDIBLE] round robin to all the active channels, and it will select them in turn.
It's important to note that, if you have high priority channels and low priority channels programmed, if the high priority channels continue to have data, they will starve the low priority channels, and this is only considering the DMA itself. The IP that the DMA is connected to has the ability to override this priority and only enable channels that it wants to see. In fact, some of them actually do this.
FFTC is a good example. FFTC will tell the DMA which channel it wants to receive on, and it'll disable certain channels so that other channels can come through. So from the DMA side, starvation is possible, but the IPs can override this and prevent starvation.
The antenna interface has a separate scheduler, so it has full control of the data that gets sent into the antenna interface. On the receive side, channel triggering is not by the Queue Manager, but by transactions coming in on the receive streaming interface. So that's one of the real big differences between the receive side and the transmit side, is how those channels are triggered.
On the receive side, the core behaves the way it does based on how you program this thing called an Rx flow, and we'll talk about that a little bit more in the packet DMA portion of the training. The basic idea here is the data coming in on the streaming interface has no format. There is no descriptor, so the core doesn't know how to built the descriptor. So the purpose of the Rx flow is to tell the core how to construct a descriptor because when it's done receiving the packet, it has to push the descriptor into a queue, and the Rx flow is what gives it that information.
Another part of the packet DMA we mentioned in the architecture are the two 128-bit streaming interfaces, one for Tx and one for Rx. This is how the packet DMA communicates with the IP block. And again for the Queue Manager infrastructure DMA, these are wired together in a loopback mode. And as mentioned previously, neither core cares about the payload format. The transmit core and the Rx core-- it just is sending and receiving the bits.
So we've mentioned this thing called a descriptor. A descriptor is what allows the packet DMA to send and receive packets. At a high level, a descriptor is two pieces. First part is the header, which describes all the particular information that the packet DMA and the host will need to parse the packet, and then the payload itself.
There are two types of descriptors used by Navigator. The first is called the host scripter. Provides a lot of flexibility, but it's a little bit more challenging to use. The other is the monolithic type. Monolithic types are less flexible, but they're easier to use.
Now, let's talk a little bit more about the host type descriptor. The descriptor part itself that is stored in the memory region is only the header. The header contains two pointers-- one pointer to the payload, which can be virtually anywhere in memory. It also contains a pointer to the next host packet, and this way, we can chain host packet descriptors together to create a massive packet. There's basically no limit in linking these things together beyond the number of descriptors that you define within your application.
The last link in the host buffer descriptor should be null, which identifies the end of the chain. The packet length of the host packet is determined by the sum of all the payloads. As a part of the header fields, there's a packet length field that is always identified as the sum of the payloads.
The monolithic descriptor is a little bit different. In its descriptor memory that you set aside, the memory itself contains both the header and the payload. This confines the size of the payload. It also means that all the payload buffers are the same size because we need to define the memory regions. All the descriptors within each region are defined to be the same size, so it's a little bit more constrictive, but it's easier to use because you don't have the buffer links to worry about.
This diagram shows a few examples of how we queue descriptors together. Looking at the left column, we have a host packet descriptor, which is linked to a few more host buffer descriptors. A host packet descriptor and a host buffer descriptor are virtually the same thing. The exception is that the host packet type has a few more fields and always appears at the start of the packet or in the first position of the descriptor, so we have this notion of start of packet and end of packet, as shown here by SOP and MOP and EOP.
So SOP is Start of Packet. MOP is Middle of Packet. EOP is End of Packet. So in the SOP position for host descriptors, you'll always find the host packet type, and all the link buffers that follow that will always be host buffer descriptors. And if you look at the definitions of these types in the Multicore Navigator user's guide, you'll see that they're virtually identical, except that the host packet type has a few more fields in it, and those fields are only needed once.
In this descriptor, we see that the following descriptor buffers are linked together by their buffer pointers, their next descriptor pointer. We also have their packet pointer, which points to the payload. It's important to note here-- what is actually queued into the Queue Manager is only the startup packet descriptor. We don't push each link in the chain into the Queue Manager-- just the first link. The remaining payloads are maintained by the links that are in the descriptors themselves.
The second descriptor we have shown is a monolithic descriptor, and again, it has no payload pointer because the payload is contained within the descriptor. The third descriptor that is queued is another host packet type. It has a null next descriptor pointer, which means it has nothing linked to it, yet it does have a valid payload pointer. So shone across the top in the dashed line arrows are the links to what is actually queued in the Queue Manager, so when you go to look at the link RAM-- and we'll discuss that in the Queue Manager part, which is in another lesson here-- you'll see that only those nodes are linked and are seen in the linking RAM.
It's also good to note that this is an example-- and normally you do not mix descriptor types in the same queue, as shown here. Normally, each queue would be only a host type or only null type or only monolithic type. It's OK to do that on the transmit side, but on the receive side, it can be a little dangerous. The problem is when the Rx DMA pops the descriptor to get a buffer, if it's expecting one type, and it gets another type, errors can happen. Especially on the receive side, it's best to keep your queues dedicated to one type, either host type or monolithic.
All right, that concludes the overview portion. Now, let's take a look at the specific Navigator functionality. Navigator functionality can be organized into three major areas. First and foremost is data movement. We can move data in and out of peripherals, we can do core to core transfers using the infrastructure or the Queue Manager packet DMA, and we can chain transfers together from one packet DMA to another packet DMA.
The second major area is inner processor communication, where we can do things like synchronizing tasks in cores using the Queue Manager. We can also do zero copy transfers using Notification also, using the Queue Manager. The third area is job management. Again, using the Queue Manager, we can do resource sharing and load balancing, and we'll get into that in a little bit more detail as we continue.
So let's talk about normal data movement cases first. The first case is when we move data in or out of a peripheral. The diagram on the top right is an example of this. What we're doing is simple transmit into a peripheral.
So the way it works is we have a Tx free queue that has been initialized with descriptors, and each of those descriptors has an associated buffer. The host program will pop that free queue, and using the pointers provided by doing the pop, it'll fill the data that it wants to send into the buffer and push the descriptor into the transmit queue. This lights up that queue pend signal to the Tx DMA, which causes it to wake up and pop the descriptors from the Tx queue, load the data in, and push it through the streaming interface to the peripheral. And from that point, what happens to the data is up to the peripheral, whether it consumes it-- for example, if it's the FTC, it'll do a forward or reverse transform on it, and then it'll send the data out the receive side.
If it's the Queue Manager packet DMA, then the data gets shipped immediately out, and that's exactly what's shown in the diagram on the bottom right. It's an infrastructure or core to core transfer, and in this case, what was described previously for the transmit happens identically. The user has set aside a transmit free queue that has been initialized with descriptors. He pops a descriptor from the Tx free queue, loads the buffer data, pushes the descriptor back onto the transmit queue. This activates the transmit DMA, which reads the data in, ships it across the streaming interface, which automatically activates the receive side because, remember, in this case, the streaming interface is connected in loopback mode.
The Rx DMA wakes up, and it goes and pops a descriptor from the Rx free queue, which provides a buffer for it to write into. And it takes the data that was sent over the streaming interface and writes it to that buffer. When all the data has been buffered, it then pushes the descriptor to the destination or the receive queue. So that, in a nutshell, is the pushing and popping mechanism of doing a simple transfer through the infrastructure of package DMA.
Another type of data movement is what we call chaining, which is where we use the output from one packet DMA to trigger the second packet DMA, and as you can see by the diagram, this requires a little extra consideration and initialization. So starting with the producer and working our way down to the consumer, this is what happens. First, the producer has a Tx free queue set aside. He'll pop a descriptor from there, which will give him a buffer to fill. Once he's filled that buffer, he'll push that descriptor onto the Tx queue, which will trigger the transmit DMA in the first peripheral to start transmitting the data.
When that transfer is done, the streaming interface within that first peripheral will push that data-- all right, starting that last sentence over. When he's done, the streaming interface within that first peripheral will push that data out to the receive side of DMA. The DMA will go fetch an Rx free descriptor from the free descriptor queue. That gives him a buffer to write to. He takes the data that came in across the streaming interface and writes it to the buffer.
When he's done, he pushes that descriptor to his Rx destination queue, and this is the point where the chaining comes in. The Rx destination queue is the same queue as the transmit queue for the second peripheral, so the act of pushing it by peripheral one causes it to be popped by peripheral two. So peripheral two will pop the descriptor, load the buffer in, perform his process on the data, and then send it out to the receive side DMA. That DMA will pop its own free descriptor, write the data out to that corresponding buffer, and will push the descriptor to his Rx destination queue, at which point the consumer can pop that queue, consume the data, and recycle that descriptor by pushing it back into the free queue. So there's a lot of data manipulation and data transfer going on, but all that's happening in the background between the producer and the consumer with no host intervention whatsoever.
For inter-processor communication, we're basically using a Queue Manager without using a packet DMA. The first mode of this is synchronization, and the first example is shown here on the right. Here, we're using queues as a sync resource. In this example, the master of the sync will pop from a free queue and push to a sync queue.
The slave will do the reverse. He will pull, or wait for an interrupt for that sync queue, and when the descriptors arrive from the master, he'll pop them and push them back onto the free queue. It's worth noting that there can be multiple slaves hanging off the sync. For example, the master can be core zero, and all the remaining cores can be slaves, so they all wait for core zero to complete something before they move on.
This can also be done a few other different ways, one of them being we can write directly to the Queue Manager's Interrupt Distributor. By writing directly to the registers, we can activate and interrupt. Another way would be to write to the queues that are set aside to activate the queue level interrupt controller. So these are some other ways of doing a synchronization.
The other flavor of IPC is notification, which is basically doing a synchronization, but we have a zero copy message that we're notifying a consumer to go get. In this case, the producer will write a buffer of data out to memory. He'll trigger a sync, and a consumer will know where to get the data, and in that method, there's no copying of data through a packet DMA. The diagram on the right shows that the descriptor is being used buffered the data. This is one method of doing it.
You can just use the descriptor header to push the descriptor buffer address across. You don't necessarily have to use the full descriptor mechanism as long as both consumer and producer know where to go get the data when a sync queue is pushed. So those are a few ways of doing notification and synchronization using the Queue Manager.
Another area of Navigator functionality is job management. In this mode, we use the Queue Manager with the scheduler to perform some kind of task, and the two main variations of this are resource sharing and load balancing. In resource sharing, we basically have multiple producers who want to use the same single resource, so we use a number of queues and a scheduler so that as the producers have data or tasks that they need to run, the scheduler will pool these tasks. As the producers push data into them, the schedulers will decide who gets to use the resource next, and it'll push into another queue, which drives their resource.
In load balancing, we do the reverse. We have a single job stream, and we have multiple resources. In this case, the scheduler's task is to decide which of the resources is least busy, and then send that job or task or item to that queue.
So those are some examples of many to one and one to many and distributed multi-scheduler. A distributed multi-scheduler is another one, but blah, blah, blah. We'll just cut that short. OK, so this concludes part one of the Multicore Navigator training, the Multicore Navigator overview. For more information, please refer to the documentation shown, and if you continue the next two modules, the Queue Manager Subsystem and the packet DMA, we'll go into those two components in a bit more detail.

Details

Date:
November 9, 2010

Multicore Navigator Overview provides an introduction to the architecture and functional components of the Multicore Navigator, which includes the Queue Manager Subsystem (QMSS) and Packet DMA (PKTDMA).