Tools

LIW Network Processor Core

I spent about 25 years of my working career either using, designing or testing general purpose processors, from mainframes to microprocessors. I recently had an opportunity to work on a networking chip and this is what I discovered about their internal processors. The purpose of this article is to document my rationale for taking the approach that I have selected. In support of my thinking I’ve found some products from the biggest of networkers that employed that same techniques, LIW(long instruction word) processors.

The product that I worked on is a host channel adapter, a device that connects a host system to a network fabric. It is my perception, that a channel adapter needs the most computational flexibility when compared to a switch or router, since it may need the ability to navigate through complex data structures in host memory. The channel adapter deals mostly with the layer 4 (transport) protocols.

The first version of this device was already designed and had a custom processor for its intelligence which clock in at 125 Mhz. To this processor’s credit, it has some specialized logic that accelerated the encoding/decoding of packet headers. The processor came as a soft macro consumed about 1 mm^2 of die area when synthesized.

For the second edition of this device we were asked to investigate using the company’s well known embedded processor, which came as a hard macro and clocked in at 800Mhz and consumed about 15mm^2. Additionally, this processor had all the advanced features of general purpose processors such as high set associative caching, pci bussing and low power design. At first glance it certainly looked like a definite win in spite of the larger area, since this processor should have at least a five fold performance advantage and a mature software development tool chain.

One of the first optimization we wanted to make for this design was to unburden the processor from doing bulk data movement. We would dedicate that task to DMA units. This is when the first problem arose since there is no external visibility into the embedded processor’s data cache and this processor is not multiprocessor ready. In order to prevent the DMA from moving stale data, it would be necessary for code to explicitly evict modified data from the data cache to an external memory. The alternative would be to program the data cache for write through mode, but then this would increase bus activity.

One of the challenges to designing an efficient channel adapter is dealing with the long host memory latency. To hide this latency each processor needs to have multiple messages to work on and when the processing arrives at a point where a host memory access is required it should program an external agent to fetch the data and switch to the processing of another message that has its necessary data available. The number of distinct messages or threads needed to keep the processor busy is roughly related to the ratio of host memory access time vs. on chip memory access time. There is usually more than an order of magnitude difference. We were able to segment the message/thread processing into about a dozen or so steps with each step usually ending with a host memory access and a switch to the next message/thread. Each segment was fairly short, which is where the next problem became apparent. The embedded processor required about 10 clocks to access external data. This is a significant latency when each processing segment was estimated to average 50 clocks. Another problem is that the PCI speed is significantly slower than the chip’s synchronous SRAMS. It was becoming obvious that the embedded processor’s performance was being limited by access to on chip memory and might actually have less performance than thought while consuming fifteen times more die area.
We started looking at all the features that the embedded processor offered that might be irrelevant to the task at hand. There is only a need for simple arithmetics, logicals, shifts and masks to encode or decode packet headers. I thought that gate intensive arithmetics like multiplier arrays are probably not necessary. We could get by with shift and add algorithms for the few time we need to multiply. Most likely such a computation would be used to index into an array and if we were smart we might organize our array elements to be some power of two long. One could argue for a memory management unit, but it would add to access latency so for this closed environment I though we could leave it out. Dynamic branch prediction can give success rates of 99% prediction, but for the embedded environment a rule of thumb is that static branch prediction can get you to 90%. Since our goal is to have a short pipeline we opted for static and avoid having to support a branch history table. Also I might be able to make use of the prediction bit to help speculatively prefetch instructions from code store. Since the on chip SRAMs are roughly the same speed as the embedded processors caches, I decided that we didn’t need to maintain an instruction cache.

Next we examined what types of activities a network processor is likely to be involved. It does three things, moves data, encodes/decodes headers and makes decisions. These operations are not complicated, but the processor should do them quickly which led us to entertain combining all three operations into a single instruction. We ended up with a process, memory and branch operation for each instruction. Some of the field ended up having multiple uses for instance the branch address could be used as a literal for process operations. When encoding or decoding headers the processor needs to access only small amounts of data that are used once. I decided on the idea of using only a line cache that after being loaded could be read or written several times before being transmitted to memory. We made the write register (line cache) write combining so that we could assemble the header during several process operations. It is also possible to incorporate hard wired logic to accelerate header encoding/decoding. We wanted to replace the data cache with on chip synchronous SRAM that is connected to the processor by a cross bar. The cross bar would give priority access to the processor, but allows external agents such as a DMA to also access it.

The next goal was to design and measure what kind of clock rate we might achieve. I chose to use loose encoding for the instruction fields that directly control the enabling of a gate or the steering of a MUX and avoid as much as possible any decoding steps. The processor is segmented into four or five stages depending on chip memory speeds. Using aa (Slow, Slow, Slow) corner and the most pessimistic wire loading and even a voltage less than design, we were able to substantially exceed 500Mhz on a 0.13u CMOS process. I think that we can do much better under normal conditions.

I made some effort to improve the code density by allowing for cooperation between fields under certain condition. It is possible to fetch data from memory, process it through the ALU and branch on a condition code all in one instruction, although the process and branch will stall until the memory access is completed. I am sensitive the extra control store required for a 64 bit wide instruction and if necessary it is possible to switch to a 32 bit mode, but I don’t expect code sizes for these kinds of tasks to be greater than 32Kbytes. Overall the size of this processor is roughly 10K gates, which would enable a design to incorporate several of these onto a die.

In summary, after looking at using an existing embedded microprocessor for a networking core, I decided that there were enough negatives to exclude it from our design and develop a custom processor that should easily outperform any other network processor core that was derived from a general purpose processor. It’s small size can lower overall cost and enables multiprocessor configurations. It is true that the general purpose microprocessors have mature software tools, but I’d sacrifice that for a superior design. I invite the interested to visit http://home.att.net/~tekhknowledge/ where I have some block diagrams and a downloadable simulator and assembler. Also feel free to contact me with questions. Also note that this processor is not exactly the same as the one developed during my contract. I think that this new one is even better.