High speed networks put a heavy load on network processors, therefore optimization of applications for these devices is an important area of research. Many network processors provide multiple processing chips, and it is up to the application developer to utilize the available parallelism. To fully exploit this power, one must be able to parallelize full end-to-end applications that may be composed of several less complex application kernels. This thesis presents a multi-threaded end-to-end application benchmark suite and a generic network processor simulator modeled after the Intel IXP1200. Using our benchmark suite we evaluate the eﬀectiveness of network processors to support end-to-end applications as well as the eﬀectiveness of various parallelization techniques to take advantage of the network processor architecture. We show that kernel performance is an inaccurate indicator of end-to-end application performance and that relying on such data can lead to sub-optimal parallelization.

Chapter 1 Introduction
As available processing power has increased, devices that traditionally used Application Speciﬁc Integrated Circuit (ASIC) chips are beginning to use programmable processors in order to take advantage of their ﬂexibility. This increase in ﬂexibility has traditionally been gained at the sacriﬁce of speed. Network processors aim to bridge the gap between speed and ﬂexibility by taking advantage of the beneﬁts of both ASICs and general purpose processors. There is no single unifying characteristic that allows all network processors to accomplish this goal. However, there are several major strategies employed to bridge the gap: parallel processing, special-purpose hardware, memory structure, communication mechanisms, and peripherals [24]. Network processors have made it possible for the deployment of complex applications into the network at nodes that previously acted only as routers and switches. High speed networks put a heavy load on network processors, therefore optimization of applications for these devices is an important area of research. It is up to the application developer to utilize the parallelism available in network processors. Parallelization of kernels is often a trivial task compared to parallelization 1

of end-to-end applications. In the context of this thesis, kernels are programs that carry out a single task. This task is of limited use in and of itself, however multiple kernels can often be combined to provide a more useful solution. In the area of networking, kernels are programs such as MD5, URL-based switching, and AES discussed in Chapter 4. These kernels can also be applicable outside the area networking, however since the context of this thesis is networking, these kernels focus on packet processing. An end-to-end application refers to a useful combination of kernels. The endto-end application discussed in Chapter 5 makes use of the kernels in Chapter 4 by ﬁrst calculating the MD5 signature of each packet, then determining its destination using URL-based switching, and ﬁnally encrypting it using AES. In the proposed scenario, the integrity of the packet could be veriﬁed and its payload decrypted at the destination node. Our ﬁrst contribution is the creation of a simulator that emulates a generic network processor modeled on the Intel IXP1200. Our simulator ﬁlls a gap in existing academic research by supporting multiple processing units. In this way, interaction between the six microengines of the Intel IXP1200 can be simulated. We chose to emulate the IXP1200 because it is a member of the commonly used Intel IXP line of Network Processing Unit (NPU)s. Our second contribution is the construction of multi-threaded, end-to-end application benchmarks based on the NetBench [18] and MiBench [10] singlethreaded kernels. Since network processors are capable of supporting complex applications, it is important to have benchmarks that fully utilize them. Existing benchmark suites make it diﬃcult to research the properties of parallelized end-to-end applications since they are made up of single-threaded kernels. Our benchmarks have been designed to provide insight into the characteristics of end2

to-end applications. Our third contribution is an analysis of our multi-threaded, end-to-end application benchmarks on our network processor simulator. This analysis reveals characteristics of the kernels making up the end-to-end applications and the endto-end applications themselves, as well as insight into the strengths and weaknesses of network processors. This paper is organized as follows. In the next chapter we provide background and related work. In Chapter 3 we present our simulator. The kernels that make up our end-to-end application benchmark are presented in Chapter 4. Chapter 5 describes our testing methodology and our evaluation of the eﬀectiveness of network processors to support end-to-end applications as well as the eﬀectiveness of various parallelization techniques to take advantage of the network processor architecture. Finally, our conclusion is presented in Chapter 6.

3

Chapter 2 Related Work
As the size and capacity of the Internet continues to grow, devices within the network and at the network edge are increasing in complexity in order to provide more services. Traditionally, these devices have made use of ASICs which provide high performance and low ﬂexibility. NPUs bridge the gap between speed and ﬂexibility by taking advantage of the beneﬁts of both ASICs and general purpose processors. There is no single unifying characteristic that allows all network processors to accomplish this goal. However, there are several major strategies employed to bridge the gap: parallel processing, special-purpose hardware, memory structure, communication mechanisms, and peripherals [24]. Network processors have made it possible for the deployment of complex applications into the network at nodes that previously acted only as routers and switches.

2.1

Network Processors

NPU is a general term used to describe any processor designed to process packets for network communication. Another characteristic of NPUs is that their 4

programmability allows applications deployed to them to access higher layers of the network stack than traditional routers and switches. The OSI reference model deﬁnes seven layers of network communication from the physical layer (layer 1) to the application layer (layer 7) [15]. NPUs are capable of supporting layer 7 applications which have traditionally been reserved for desktop and server computers. There are over 30 diﬀerent self-identiﬁed NPUs available today [24]. These NPUs can be classiﬁed into two categories based on their processing element conﬁguration: pipelined and symmetric. A processing element (PE) is a processor able to decode an instruction stream [24]. Pipelined conﬁgurations dedicate each PE to a particular packet processing task, while in symmetric conﬁgurations each PE is capable of performing any task [24]. Both of these conﬁgurations are capable of taking advantage of the inherent parallelism in packet processing. Pipelined architectures include: Cisco PXF [25], EZChip NP1 [8], and Xelerator Network Processors [31]. Symmetric architectures include: Intel IXP [6] and IBM PowerNP [1]. High-speed networks place high demands on the performance of NPUs. In order to prevent network communication delays, NPUs must quickly and eﬃciently process packets. Parallel processing through the use of multiple PEs is only one strategy used in NPUs to improve performance. Another strategy is to use special-purpose hardware to oﬄoad tasks from the PEs. Special-purpose hardware includes co-processors and special functional units. Co-processors are more complex then functional units. They may be attached to several PEs, memories, and buses, and they may store state. A co-processor can be advantageous to the programmer when implementing an application, but

5

can also dictate that the programmer use a speciﬁc algorithm in order to take advantage of a particular co-processor. Special functional units are used to implement common networking operations that are hard to implement eﬃciently in software yet easy to implement in hardware [24]. Since memory access can potentially waste processing cycles, NPUs often use multi-threading to eﬃciently utilize processing power. Hardware is dedicated to multi-threading such as separate register banks for diﬀerent threads and hardware units to schedule and swap threads with no overhead. Special units also handle memory management and the copying of packets from network interfaces into shared memory [24].

2.1.1

Intel IXP1200

The IXP1200 was designed to support applications requiring fast memory access, low latency access to network interfaces, and strong processing of bit, byte, word, and longword operations. For processors, the IXP1200 provides a single StrongARM processor and six independent 32-bit RISC PEs called microengines. This boils down to a single powerful processor coupled with 6 very simple, weaker engines for highly parallel computation. In addition, each microengine provides four hardware supported threads with zero-overhead context switching. The StrongARM was designed to manage complex tasks and to oﬄoad speciﬁc tasks to individual microengines [6]. The StrongARM and microengines share 8 MBytes of SRAM for relatively fast accesses and 256 MBytes of SDRAM for larger memory space requirements (but slow accesses). There is also a scratch memory unit available to all processors consisting of 1 MByte SRAM. The StrongARM has a 16 KByte instruction

6

cache and 8 KByte data cache, providing it with fast accesses on a small amount of data. Each microengine has a 1 KByte data cache and a large number of transfer registers. The IXP1200 platform does not provide any built-in memory management, therefore the application developers are responsible for maintaining memory address space [6].

2.2

Network Processor Simulators

Simulators are often used to execute programs written to run on hardware platforms that are inconvenient or inaccessible to developers [28]. Simulators are also able to provide performance statistics such as cycle count, memory usage, bus bandwidth, and cache misses. These statistics enable developers to identify bottlenecks and tune applications to speciﬁc hardware conﬁgurations. Simulators are an important aspect of research in network processors due to the high-cost and the wide variety of architecture found in current NPUs. Highcost often makes cutting-edge NPUs inaccessible in academic research although outdated NPUs are becoming more accessible. The wide variety of NPU architectures makes developing applications to run across multiple platforms diﬃcult. Since simulators can potentially be conﬁgured to simulate multiple platforms, analysis of architectural diﬀerences can be performed.

2.2.1

SimpleScalar

SimpleScalar provides tools for developing cycle-accurate hardware simulation software that models real-world architecture [3]. We chose to use SimpleScalar because of its prevalence in architectural research. SimpleScalar takes as input

7

binaries compiled for the SimpleScalar architecture and simulates their execution [3]. The SimpleScalar architecture is similar to MIPS, which is commonly found in NPU platforms such as the Intel IXP. A modiﬁed version of GNU GCC allows binaries to be compiled from FORTRAN or C into SimpleScalar binaries [3].

2.2.2

PacketBench

PacketBench is a simulator developed at the University of Massachusetts to provide exploration and understanding of NPU workloads [22]. PacketBench makes use of SimpleScalar ARM for cycle-accurate simulation [22]. PacketBench also emulates some of the functionality of a NPU by providing a simple API for sending and receiving packets and for memory management [22]. In this way, the underlying details of speciﬁc NPU architectures are hidden from the application developer. Although PacketBench is useful in characterizing workload, it does not provide simulation support for multiprocessor environments. Since NPUs make extensive use of parallelization, we chose not to use this tool.

2.3

Benchmarks

Benchmarks are applications designed to assess the performance characteristics of computer hardware architectures [27]. One approach is to use a single benchmark suite to compare the performance of several diﬀerent architectures. Another approach is to compare the performance of diﬀerent applications on a speciﬁc architecture. Benchmarks designed to mimic a particular type of workload are called Synthetic, while Application benchmarks are real-world applications [27]. For the purposes of this paper, our interest is in application bench-

8

marks, and more speciﬁcally, representative benchmarks for the domain of NPUs.

2.3.1

MiBench

MiBench is a benchmark suite providing representative applications for embedded microprocessors [10]. Due to the diversity of the embedded microprocessor domain, MiBench is composed of 35 applications divided into six categories: Automotive and Industry Control, Network, Security, Consumer Devices, Oﬃce Automation, and Telecommunications. The Network and Security categories include Rijndael encryption, Dijkstra, Patricia, Cyclic Redundancy Check (CRC), Secure Hash Algorithm (SHA), Blowﬁsh, and Pretty Good Privacy (PGP) algorithms. The Telecommunications category consists of mostly signal processing algorithms, while the other categories are not relevant to this discussion. All MiBench applications are available in standard C source code allowing them to be ported to any platform with compiler support.

NetBench is a benchmarking suite consisting of a representative set of network applications likely to be found in the network processor domain. These applications are split into three categories: micro, IP, and application. The micro-level includes the CRC-32 checksum calculation and the table lookup routing scheme. IP-level programs include IPv4 routing, Deﬁcit-Round Robin (DRR) scheduling, Network Address Translation (NAT), and the IPCHAINS ﬁrewall application. Finally, application-level includes URL-based switching, Diﬃe-Hellmen (DH) encryption for VPN connections, and Message-Digest 5 (MD5) packet signing [18]. Although CommBench and NetBench oﬀer good representations of typical network applications, they are both limited to single-threaded environments. Our work builds on the NetBench suite by parallelizing several NetBench applications.

2.4

Application Frameworks

Application framework is a widely used term referring to a set of libraries and a standard structure for implementing applications for a particular platform [26]. Application frameworks often promote code-reuse and good design principles. Several frameworks for NPUs are available in academia, each oﬀering various beneﬁts to application developers. NPU vendors also provide frameworks speciﬁc to their architectures, such as the Intel IXA Software Development Kit [5]. One key advantage of academic frameworks is the possibility that they will be able to support multiple architectures, thus enabling developers to design and implement applications independent of a speciﬁc architecture. Unfortunately, of the NPU-speciﬁc frameworks surveyed in this paper only NEPAL currently 10

realizes cross-platform support. The others are currently striving to meet this goal.

2.4.1

Click

Click is an application development environment designed to describe networking applications [13]. Applications implemented using Click are assembled by combining packet processing elements. Each element implements a simple autonomous function. The application is described by building a directed graph with processing elements at the nodes and packet ﬂow described using edges. Click supports multi-threading but has not been extended to multiprocessor architectures. The modularity of Click applications gives insight into their inherent concurrency and allows alterations in parallelization to be made without changing functionality.

2.4.2

NP-Click

NP-Click is based upon Click and designed to enable application development on NPUs without requiring in-depth understanding of the details of the target architecture [19]. NP-Click oﬀers a layer of abstraction between the developer and the hardware through the use of a programming model. The code produced using the NP-Click programming model has been shown to run within 10% of the performance of hand-coded solutions while signiﬁcantly reducing development time [19]. The current implementation of NP-Click targets only the Intel IXP1200 network processor although a goal of this project is to support other architectures.

11

2.4.3

NEPAL

The Network Processor Application Language (NEPAL) is a design environment for developing and executing module-based applications for network processors [17]. In a similar fashion to Click, application development takes place by deﬁning a set of modules and a module tree that deﬁnes the ﬂow of execution and communication between modules. The platform independence of NEPAL was veriﬁed using their own customized version of SimpleScalar ARM simulator for multiprocessor architectures. They provide performance results for two simulated NPUs modeled after the IXP1200 [6] and Cisco Toaster [25].

2.4.4

NetBind

NetBind is a binding tool for dynamically constructing data paths in NPUs [4]. Data paths are made up of components performing simple operations on packet streams. NetBind modiﬁes the machine code of executable components in order to combine them into a single executable at run-time. The current implementation of NetBind speciﬁcally targets the IXP1200 network processor, although it could be ported to other architectures in the future.

12

Chapter 3 The Simulator
The simulator developed for this work was built on the SimpleScalar tool set [3]. SimpleScalar provides tools for developing cycle-accurate hardware simulation software that models real-world architecture. This simulation tool set was chosen because of its prevalence in architectural research. For this work, we modiﬁed an existing simulator with support for multiple processors in order to create a generic network processor simulator modeled after the Intel IXP1200. We chose to model the IXP1200 because it is a member of the commonly used Intel IXP line of NPUs.

3.1

Processing Units

The simulator includes a single main processor and six auxiliary processors each supporting up to four concurrent threads. This conﬁguration corresponds to the StrongARM core processor and accompanying microengines on the IXP1200. The StrongARM core is represented by an out-of-order processor. The microengines are represented by single-issue in-order processors. Since each micro13

Table 3.1. Processor Parameters engine must support four threads with zero overhead context switching [6], the simulator creates one single-issue in-order processor for each microengine thread. When a single-issue in-order processor is created, it is given the number of threads allocated to its physical microengine so it knows to execute every n cycles, where n is the number of threads on the microengine. The total number of required threads is speciﬁed on the command line when the simulator is run, therefore unused threads are not created.

3.2

Memory Structure

The StrongARM and microengines share 8 MBytes of SRAM and 256 MBytes of SDRAM [6]. There is also a scratch memory unit consisting of 1 MByte SRAM. These memory units are represented in the simulator using a single DRAM unit. Separate DRAM caches back these memory units for the StrongARM and microengines. The StrongARM has a 16 KByte instruction cache and 8 KByte data cache that are backed by SRAM [6]. Each microengine has a 1 KByte data cache and unlimited instruction cache. The microengines are given unlimited instruction cache in order to mimic the behavior of the large number of transfer registers associated with each microengine on the IXP1200. Since the number of simulated registers cannot exceed the number of physical registers on the host architecture,

14

we determined this to be the best option available. Since the IXP1200 is capable of connecting with any number of network devices through its high speed 64 bit IX bus interface, the amount of delay incurred to fetch a packet could very greatly. For the purposes of this simulator, network delay is not important and it is assumed that the next packet is available as soon as the application is ready to receive it. In order to imitate this behavior, a large chunk of the DRAM unit is allocated as “network” memory and is backed by a no-penalty cache object available to all processors. The simulator does not provide any built in memory management, therefore the application developers are responsible for maintaining memory address space. The simulator assigns address ranges to each of the memory units. SRAM is dedicated for the call stack and DRAM is broken up into a range for text, global variables, and the heap.

3.3

Methods of Use

The simulator compiles to Linux using GCC 3.2.3 to an executable called sim3ixp1200. The simulator takes a list of arguments that modify architectural defaults and indicate the location of a SimpleScalar native executable and any arguments that should be passed to it. Its use can be expressed as: sim3ixp1200 [-h] [sim-args] program [program-args] The -h option lists available simulator arguments of the form -Parameter:value. These arguments can modify aspects of the simulation architecture including the number of PEs, threads, cache speciﬁcations, and memory unit speciﬁcations. Default values for each available parameter are based on the IXP1200 architec-

15

ture. The most important parameter for this work was Threads that controls the number of microengine threads made available to the SimpleScalar application. Threads can be any value between 0 and 24 inclusive. Zero threads indicates the microengines will not be used and therefore the application will execute only on the StrongARM processor. When the number of threads are greater than zero, they are allotted to the 6 possible microengines using a round-robin scheme so that the threads are distributed as evenly as possible. For instance, if 8 threads are requested, then 4 microengines will be run 1 thread and 2 microengines will run 2 threads.

3.4

Application Development

The applications developed for this work were written in C and compiled using a GCC 2.7.2.3 cross-compiler. A cross-compiler translates code written on one computer architecture into code that is executable on another architecture. For this work, the host architecture was Linux/x86 and the target architecture was SimpleScalar PISA (a MIPS-like instruction set). Since the simulator does not support POSIX threads, developing multi-threaded applications follows a completely diﬀerent path. Instead of the main process spawning child threads, the same application code is automatically executed in each simulator thread. In order for the application code to distinguish which thread it is running in, a function called getcpu() that returns an integer is made available by the simulator. This function, although mis-named, returns the thread identiﬁer, not the CPU identiﬁer. Code that is meant to run in a particular thread must be isolated in an if block that tests the return value from 16

getcpu(). This function requires a penalty of one cycle, but it is typically called only once and its value stored in a local variable during the initialization of the application. A global variable called ncpus is automatically made available by the simulator and populated with the number of threads. It is often necessary in application development to require all threads to reach a particular point before any thread is allowed to proceed. This is accomplished using another function made available by the simulator called barrier(). A call to barrier() requires one cycle for the function call, but induces no penalty while a thread waits. The simulator reports statistics on the utilization of each hardware unit at the end of each execution. For each PE this includes cycle count, instruction count, and fetch stalls. For each memory unit this includes hits, misses, reads, and writes. In addition, the simulator provides a function called PrintCycleCount() that can be used at any time to print the cycle count of the current thread to standard error and standard output. This function is useful when an application has an initialization process that should not count towards the total cycle count. By making a call to PrintCycleCount() at the beginning and end of a block of code, the total cycle count for that block can be determined by analyzing the output. When the developer requires that the application make some calculation based on cycle count, the function GetCycles() returning and integer can be used. Both of these functions induce a penalty of one cycle for the call, but no penalty for their execution.

17

Chapter 4 Benchmark Applications
Previous research in the area of network processors has focused on exploring their performance characteristics by running individual applications in isolation and in a single threaded environment. Network processors are capable of supporting more complex applications that guide packets through a series of applications running in parallel. For the ﬁrst stage of this work we ported three typical network applications to our simulator: MD5, URL-switching, and Advanced Encryption Standard (AES). This process involved modifying memory allocations to use appropriate simulator address space and reorganizing each application to take advantage of multiple threads. For the second stage of this work we combined these three applications into three types of end-to-end applications: shared, static, and dynamic. These distinctions refer to three diﬀerent ways of utilizing the available threads.

18

4.1

Message Digest

The MD5 algorithm [23] creates a 128-bit signature of an arbitrary length input message. Until recently, it was believed to be infeasible to produce two messages with the same signature or to produce the original message given a signature. However, in March 2005, Arjen Lenstra, Xiaoyun Wang, and Benne de Weger demonstrated [14] that two valid X.509 certiﬁcates [11] could be created with identical MD5 signatures. Although more robust algorithms exist, MD5 is still extensively used in public-key cryptography and in verifying data integrity. Our implementation of MD5 was adopted from the NetBench suite of applications for network processors [18, 16]. The NetBench implementation was designed to process packets in a serial fashion utilizing a single thread. The multi-threaded, multiprocessor nature of NPUs is better utilized by processing packets in parallel. In order to analyze the performance characteristics in this environment, our implementation of MD5 oﬄoads the processing of packets to available microengine threads. In this way, the number of packets processed in parallel is equal to the number of microengine threads. As shown in Figure 4.1, the StrongARM processor is responsible for accepting incoming packets and distributing them to idle microengine threads. Communication between the StrongARM and microengines is done through the use of semaphores. When the StrongARM ﬁnds an idle thread, it copies a pointer to the current packet and the length of the current packet to shared memory locations known by both the StrongARM and the thread. The StrongARM then sets a semaphore that triggers the thread to begin executing. When all packets have been processed, the StrongARM waits for each thread to become idle, then notiﬁes them to exit before exiting itself.

19

Figure 4.1. MD5 - StrongARM Algorithm Each microengine thread proceeds as shown in Figure 4.2. It waits until its semaphore has changed, then either exits or copies the current packet to its stack before processing it to generate a 128-bit signature. It then resets its semaphore and returns to waiting.

4.2

URL-Based Switch

URL-based switching directs network traﬃc based on the Uniform Resource Locator (URL) found in a packet. Other terms for URL-based switch include Layer 7 switch, content-switch, and web-switch. The purpose of switching based on Layer 7 content is to realize improved performance and reliability of web-based

20

Figure 4.2. MD5 - Microengine Algorithm services. A Layer 4 switch located in front of the cluster of servers can control how each Transmission Control Protocol (TCP) connection is established on a per connection basis. How requests are directed within a connection is out of reach to a Layer 4 switch. Traﬃc can be managed per request, rather than per connection, by a URL-based switch [12]. In order to manager requests, a URL-switch acts as the end point of each TCP connection and establishes its own connections to the servers containing the content requested by the client. It then relays content to the client. In this way, the switch can perform load-balancing and fault detection and recovery. For instance, if one server is overloaded or unreachable, the switch can send its request to a diﬀerent server. Our URL-based switching algorithm is based on the implementation found in NetBench [18, 16]. The algorithm searches the contents of a packet for a list of matching patterns. Each pattern has an associated destination that can be used to switch the packet or begin another process. The focus of our URL-based 21

switch is the pattern matching algorithm. Unlike our implementation of MD5, our URL-based switch does not utilize parallelism by processing multiple packets at once, instead it uses multiple threads to process each packet. The data structure used to store patterns is a list of lists. Each element of the primary list is made up of a secondary list and the largest common substring of the patterns in the secondary list. The algorithm proceeds as shown in Figures 4.3 and 4.4.

Figure 4.3. URL - StrongARM Algorithm Each packet received by the StrongARM is copied to the stack and run through the Internet checksum algorithm to verify its integrity. For each element in the list of largest common substrings, the StrongARM copies the element’s secondary list pointer to a shared memory location known by an idle microengine thread. A pointer to the current packet is also copied to shared memory and then the idle microengine’s semaphore is set to notify it to begin

22

Figure 4.4. URL - Microengine Algorithm executing. The microengine thread ﬁrst copies the packet to its stack and then uses a Boyer Moore search function to determine whether the packet contains the largest common pattern. If this test is positive, then the thread proceeds to search for a matching pattern in the secondary list. Otherwise, the microengine resets its semaphore and returns to an idle state. If the thread ﬁnds a matching pattern, it sets its semaphore to reﬂect this before returning to an idle state. The StrongARM continues until it reaches the end of the primary list or until a thread ﬁnds a matching pattern, it then processes the next packet.

4.3

Advanced Encryption Standard

The AES is an encryption standard adopted by the US government in 2001 [20]. The standard was proposed by Vincent Rijmen and Joan Daemen under the name Rijndael [7]. AES is a block cipher encryption algorithm based on 128, 192, or 256 bit keys. The algorithm is known to perform eﬃciently in both hardware and software. Our implementation of AES is based on the Rijndael algorithm found in 23

the MiBench embedded benchmark suite [10, 21]. In much the same way that our MD5 algorithm processes packets in parallel, our AES algorithm oﬄoads the encryption of packets to microengine threads. The encryption is performed using a 256 bit key that is loaded into each thread’s stack during startup. This algorithm executes on the simulator in the same manner as MD5 above (Figures 4.1 and 4.2).

24

Chapter 5 Results
In order to evaluate the eﬀectiveness of NPUs to support multi-threaded endto-end applications and the eﬀectiveness of various parallelization techniques to take advantage of the NPU architecture, we performed four types of tests: Isolation, Shared, Static, and Dynamic. The Isolation tests establish a baseline and explore application behavior on the multi-threading NPU architecture. The Shared tests explore how each application is aﬀected by the concurrent execution of other applications. The Static tests reveal characteristics of an end-to-end application and how to best distribute threads. Finally, the Dynamic tests serve to compare an on-demand thread allocation algorithm to statically allocated threads.

5.1

Isolation Tests

The purpose of the Isolation tests is twofold: to establish a baseline for subsequent tests and to explore the eﬀects of multi-threading on the NPU. The Isolation tests consisted of independent tests for each application. For each independent test, the number of microengine threads available to the application 25

was varied between 1 and 24, since the simulator supports up to 24 threads. A data point was also gathered for the serial version of each application in which no microengine threads were used.

5.1.1

MD5

Figure 5.1. MD5 Isolated Speedup on 1000 Packets Test results in Figure 5.1 show that parallelization of the MD5 algorithm oﬀers signiﬁcant speedup compared to its serial counterpart. The data point at zero threads represents the serial version of MD5 executed on the StrongARM processor. The data point at 1 thread represents the multi-threaded version making use of the StrongARM and a single microengine. This case is slower than the serial version because of the overhead involved in communication between the StrongARM and the microengine and because the microengine does not oﬀer as strong processing power as the StrongARM. As the number of threads increases, the combined processing power of the microengines outweighs the communication overhead.

26

The slope of the speedup graph in Figure 5.1 decreases suddenly at 7, 13, and 19 threads. These changes can be attributed to the fact that there are 6 microengines, therefore, up until 7 threads each microengine is responsible for a single thread. From 7-12 threads, each microengine is burdened with up to 2 threads. Similarly, as the number of threads increases to 24, each microengine is burdened with 3 and then 4 threads causing the speedup to approach a ﬂat line.

5.1.2

URL

Figure 5.2. URL Isolated Speedup on 100 Packets (non-polling)

Although test results for the parallelization of URL show improvements over the serial version, characteristics of the algorithm limited speedup. As stated in the previous chapter, the URL algorithm is parallelized in such a way that multiple threads work together to process each packet. Each thread is responsible for searching the packet for a particular set of patterns, and the ﬁrst match preempts further execution. The drawback of this algorithm is that since only one thread will ﬁnd a match, the other threads do work that in hindsight is unnecessary.

27

Figure 5.3. URL Isolated Speedup on 100 Packets (polling) This in itself would not be detrimental to the application’s performance except that all threads are vying for a limited number of shared resources. We developed two variations of the URL algorithm in an attempt to minimize the cycles spent searching false leads. The ﬁrst version allows each thread to run to completion after a matching pattern is found. Once a thread reports to the StrongARM that a match has been found, the StrongARM stops spawning new threads and simply waits for the active threads to ﬁnish, although their processing is immaterial. In the alternative approach, when a match is found, the StrongARM sets a global ﬂag that is constantly polled by each thread. When a thread detects that the ﬂag has changed, it stops executing. Although it was expected that the polling version of URL would perform better, it actually performed slightly worse than the non-polling version. As shown in Figures 5.2 and 5.3, the highest speedup attained by non-polling was 1.75 and for polling 1.64. Analysis of the application’s output shows that a matching pattern is found in only about 40% of the trace packets, thus polling 28

is unable to preempt execution 60% of time. The diﬀerence in speedup is due to the fact that the polling version is doing unnecessary work 60% of the time and that polling itself wasts too many cycles. In both versions of URL, the speedup drops oﬀ after reaching a maximum between 4 and 6 threads. This indicates that contention to shared resources becomes a problem after this point.

5.1.3

AES

Figure 5.4. AES Isolated Speedup on 100 Packets Speedup tests on AES show that this algorithm performs poorly when ofﬂoaded to the microengines. The AES encryption algorithm requires each packet be read and processed 16 bytes at a time. State is maintained for the lifetime of each packet in an accumulator that made up of the encryption key and state variables. In addition, a static lookup table of 8 Kbytes is required. The L1 data cache for the StrongARM is 8 Kbytes compared to 1 Kbytes for the microengines. Due to the limited size of the microengine caches, AES suﬀers from substantial 29

cache misses. Processing of each packet consumes roughly 1.36 million simulator cycles when encryption is performed on the StrongARM. The same process consumes roughly 11.4 million simulator cycles on a microengine thread when it is the only microengine thread running. This is an increase by a factor of 8.4. In contrast, MD5 consumes roughly 0.518 million cycles on the StrongARM and 0.922 million cycles on a single microengine thread. This results in an increase by a factor of 1.6. Thus, AES requires a substantially higher increase in cycles when moving from the StrongARM to a microengine thread. Figure 5.4 shows that although performance on the microengine threads is far worse than the serial version, it remains relatively constant as the number of threads increases. Therefore, the poor performance of AES on the microengine threads is primarily a result of processing power and cache size, not memory contention between threads which would be the case if speedup tailed oﬀ.

5.1.4

Isolation Analysis

These tests reveal general characteristics of each kernel on both the StrongARM and microengines. MD5 has been shown to oﬀer strong speedup on microengine threads using conventional parallelization. URL, using an alternative approach to multi-threading, has been shown to provide maximum speedup between 4 and 6 threads employing either a polling or non-polling scheme. Finally, AES reveals an algorithm with poor performance on the microengines that cannot be overcome by multi-threading.

30

5.2

Shared Tests

The purpose of the Shared tests is to determine how sensitive each kernel is to the concurrent execution of the other kernels. For these tests we ran all three kernels on the simulator at the same time. The StrongARM served as the controller, passing incoming packets to available microengine threads. We ran one test for each kernel, in which the number of threads available to the kernel under test was varied, while the threads available to the other kernels remained constant. Our baseline for each of these tests was 1 thread for MD5, 4 threads for URL, and 1 thread for AES. This baseline was chosen because running URL with few than 4 threads was found to cause a signiﬁcant bottleneck. The number of threads available to the kernel under examination was increased for each subsequent run. Each kernel processed a separate packet stream until the kernel under test completed the desired number of packets, in this case 50. Figure 5.5 shows the speedup results from all three tests on the same graph revealing the relative speedup of each kernel. Clearly, MD5 and AES have much greater speedup than URL, indicating they are less sensitive to the concurrent processing of other kernels. However, it is more interesting to compare the Shared speedup of each kernel with its Isolated speedup. This comparison is covered in the following subsections.

5.2.1

MD5

The speedup results of MD5 in the Isolation and Shared tests, shown in Figures 5.1 and 5.5 respectively, show few diﬀerences. The slope of each graph is approximately the same and both peak near a speedup of six. This indicates that MD5 is not substantially aﬀected by the concurrent execution of URL and 31

Figure 5.5. Shared Speedup on 50 Packets AES. The lightweight nature of MD5 with regards to memory is the most likely explanation for this behavior. Figure 5.6 compares the MD5 Isolation and Shared tests with regard to the number of cycles consumed by the StrongARM while 50 packets are processed, revealing that more cycles are required to process the same packet stream when MD5 is sharing the resources of the NPU. The horizontal-axis corresponds to the number of MD5 threads employed to process the packets while the vertial-axis corresponds to the number of cycles spent processing the packet stream. Since with the Shared tests 4 threads are allocated to URL and 1 to AES, these threads cause contention for access to shared resources and therefore higher cycle counts than the Isolation tests.

32

Figure 5.6. MD5 Isolated vs. Shared Cycles on 50 Packets

5.2.2

URL

Although the Shared speedup of URL shown in Figure 5.5 steadily increases, its maximum of 1.17 with 22 threads does not match the Isolation speedup shown in Figure 5.2 that peaks at 1.75 and degrades to 1.41 with 22 threads. This indicates that URL is aﬀected by the concurrent execution of other applications due to its memory access requirements.

5.2.3

AES

The Shared speedup of AES shown in Figure 5.5 is an order of magnitude greater than the Isolation speedup shown in Figure 5.4. This high speedup is due to the fact that the baseline for this test performed extremely poorly. This can be attributed to two characteristics of the AES kernel. Firstly, as shown in the Isolation tests, AES performs poorly on the microengines due to their lack of processing power and limited size of their cache. Secondly, since the

33

StrongARM is the controller for all three kernels it continuously monitors all of the microengine threads and distributes incoming packets as necessary. In the baseline, the StrongARM has to monitor one thread for each kernel, thus only one-third of its time is spent monitoring the AES thread. Therefore, the AES thread occasionally ﬁnishes processing a packet and wastes idle cycles waiting for the StrongARM to send it another packet. As more threads are allocated to AES, the StrongARM spends a larger percentage of time monitoring AES threads therefore increasing throughput.

5.2.4

Shared Analysis

The Shared tests reveal that MD5 and AES are relatively insensitive to the concurrent execution of the other kernels on a single NPU. URL, however, is sensitive, and its speedup suﬀers when it is run alongside the other kernels.

5.3

Static Tests

The Static tests were designed to reveal characteristics of the end-to-end application, such as the location of bottlenecks and the ideal thread conﬁguration. The testing process was similar to that of the Shared tests. The diﬀerence being that instead of processing independent packet streams, the applications worked together to process a single packet stream. Each incoming packet was processed ﬁrst by MD5, then by URL, and ﬁnally by AES. This scenario represents a possible end-to-end application running on a NPU as shown in Figure 5.7. The purpose of this application is to distribute sensitive information from a trusted internal network through the Internet to a variety of hosts. Each packet is re-

34

ceived by the application from the internal network, the application calculates its MD5 signature, determines its destination based on a deep inspection of the packet, and then encrypts it. Finally, the the encrypted packet along with its signature is sent to a host machine; although this step is not included in the simulated application. To complete this scenario, the host machine would decrypt the packet and verify that the contents were not modiﬁed in transit by comparing the included signature to a newly generated one. This is also not included in the simulation.

Figure 5.7. End-to-End Application Scenario

Figure 5.8. Optimization with Static Allocation of Threads For these tests, the number of threads allocated to each stage of the end-to35

end application is static for each run. Once again, the baseline test is 1 thread for MD5, 4 threads for URL, and 1 thread for AES. Each subsequent test increases the number of threads by one and attempts to determine the optimal conﬁguration. The optimal conﬁguration is determined by giving the additional thread to each of the applications in turn, and observing which conﬁguration yields the best speedup. This conﬁguration is then used as a starting point for the subsequent test. Figure 5.8 shows the resulting optimal conﬁgurations for each number of available threads between 6 and 24. These conﬁgurations were found through test runs of 50 packets. MD5 never became a bottleneck point and 1 thread remained sufﬁcient throughout the tests. URL and AES almost evenly split the remaining threads, with the a ﬁnal conﬁguration of 12 threads for AES, 11 for URL, and 1 for MD5. These results show that the demands of AES and URL are similar and parallelization oﬀers increased performance for these applications, while the simplicity of MD5 makes parallelization of it in the context of this end-to-end application unnecessary. The above discovery reveals an interesting characteristic of this end-to-end application. Although MD5 provided the best speedup in the Isolation tests, parallelizing it in the Static tests resulted in less performance improvement than further parallelization of the other applications. This can be explained by Amdahl’s Law [2], which states that the overall speedup achievable from the improvement of a proportion of the required computation is aﬀected by the size of that proportion. If P is the proportion and S is the speedup, Amdahl’s Law states that the overall speedup will be:

36

1 (1 − P ) +

P S

Therefore, the computation required to perform MD5 in this end-to-end application is a small proportion of the overall computation. Subsequently, speedup beneﬁts more through increased parallelization of URL and AES. It is also interesting to note that although AES did not beneﬁt from additional microengines during the Isolation tests (Figure 5.4), in the high-load context of this end-to-end application additional AES threads beneﬁt overall performance. Figure 5.8 also shows that initially more threads were allocated to URL and after 14 threads more threads were allocated to AES. Since URL is required to ﬁnish processing each packet before it can be sent to AES, URL caused more of a bottleneck when it had less than 10 threads. After that point, AES required 4 threads to ever 1 for URL in order to keep pace.

5.4

Dynamic Tests

The Dynamic tests present an alternative approach to the Static tests. Where the Static tests represent ideal conﬁgurations, the Dynamic tests represent realistic conﬁgurations. Static allocation of microengine threads is also much less feasible since all possible conﬁgurations must be run in order to determine the best one for the given end-to-end application. This could become an extremely complex and lengthy process. The trade-oﬀ with a dynamic heuristic is increased complexity in the logic of the application. The purpose of these tests was to determine how an on-demand allocation of threads performs against a static approach. The Dynamic tests consist of 37

all three kernels processing the same packet stream in serial, as in the Static tests, but with threads dynamically allocated based on demand. Once again, the StrongARM serves as the controller and is responsible for allocating threads. Allocation is implemented through the use of queues for each stage of the endto-end application. Each queue stores pointers to packets that are waiting to be processed by the next stage. The StrongARM detects when a queue has packets and creates threads to process them.

Figure 5.9. Dynamic Speedup on 50 Packets Figure 5.9 shows the speedup of the Dynamic application using as a baseline the Static conﬁguration consisting of 1 MD5, 4 URL, and 1 AES thread. The speedup increases from 4.29 with 6 threads to 4.39 with 24 threads, a substantial increase over the Static baseline. Figure 5.10 shows the diﬀerence between the number of cycles requires for each of the applications to process the same number of packets. While the Static version spent in the neighborhood of 1.3 billion cycles per 50 packets, the Dynamic version spent closer to 300 million, a ratio of 4.3:1. 38

Figure 5.10. Static vs. Dynamic Cycles on 50 Packets This discrepancy can be attributed to cycles wasted on idle threads. With the Static version, each thread is statically assigned to perform either MD5, URL, or AES. Since the URL kernel requires much longer to run than MD5, the queue of packets waiting for URL processing is quickly ﬁlled forcing the MD5 thread to stop processing new packets until URL can reduced the queue. At the same time, when the URL threads were unable to process packets as quickly as the AES threads, some AES threads wasted idle cycles. The Dynamic version did not suﬀer from these bottleneck issues because idle threads were put to use by whichever kernel required them. Another beneﬁt of the Dynamic version is that it is able to adjust to changes in load caused by varying packet sizes and payloads. Speciﬁcally, since URL performs a thorough string matching on the payload of each packet, the size of the packet has a large aﬀect on the number of cycles required to process it. The Dynamic version is able to minimize bottlenecks in URL due to large packets by putting more threads to work on the bottleneck.

39

Figure 5.10 also shows that neither the Static nor the Dynamic versions of the end-to-end application beneﬁt much from additional threads. The number of cycles remains relatively constants from 6 to 24 threads. The Isolation tests show that between 6 and 24 threads MD5 is the only kernel to experience significant performance improvement. The speedup of URL declines slightly and AES remains relatively constant. Therefore, with the exception of the MD5 kernel, the end-to-end applications experience performance characteristics similar to the Isolation tests. Once again, this can be explained by Amdahl’s Law [2], because MD5 constitutes only a small percentage of the overall computation. Thus, the performance of the end-to-end application is driven by the performance of the URL and AES kernels.

5.5

Analysis

We performed four types of tests for our analysis: Isolation, Shared, Static, and Dynamic. The Isolation tests established a baseline and explored kernel behavior on the multi-threading NPU architecture. The Shared tests explored how each kernel was aﬀected by the concurrent execution of other kernels. The Static tests revealed characteristics of an end-to-end application and how to best distribute threads. Finally, the Dynamic tests served to compare an on-demand thread allocation algorithm to statically allocated threads. The Isolation tests revealed general characteristics of each kernel on both the StrongARM and microengines. MD5 oﬀered strong speedup on microengine threads using conventional parallelization. URL, using an alternative approach to multi-threading, provided maximum speedup between 4 and 6 threads employing either a polling or non-polling scheme. Finally, AES revealed an algorithm with 40

poor performance on the microengines that could not be overcome by multithreading. The Shared tests revealed that MD5 and AES are relatively insensitive to the concurrent execution of the other applications on a single NPU. URL, however, was shown to be sensitive because its speedup suﬀered when it was run alongside the other kernels. The Static tests provided a baseline for the Dynamic tests and revealed the optimal thread conﬁgurations for running the end-to-end application. Results showed that the demands of AES and URL are similar and parallelization oﬀered increased performance for these applications, while the simplicity of MD5 made parallelization of it in the context of this end-to-end application unnecessary. As an alternative to statically allocating threads, the Dynamic tests explored the beneﬁts of dynamically allocating threads. Overall the Dynamic tests required less than 25% as many cycles to process each 50 packet test as the Static conterpart.

41

Chapter 6 Conclusion
We have presented a network processor simulator, multi-threaded end-to-end benchmark applications, and an analysis of the characteristics of these applications on NPUs. Our ﬁrst contribution was the creation of a simulator that emulates a generic network processor modeled on the Intel IXP1200. Our simulator ﬁlls a gap in existing academic research by supporting multiple processing units. Our second contribution was the construction of multi-threaded, end-toend application benchmarks. These benchmarks extend the functionality of existing benchmarks based on single-threaded kernels. Our ﬁnal contribution was an analysis of the characteristics of our benchmarks on our network processor simulator. Our analysis in Chapter 5 found several interesting results. Firstly, although the MD5 kernel scaled well in the Isolation and Shared tests, parallelization of it in an end-to-end application had little eﬀect due to Amdahl’s Law. Secondly, the Static and Dynamic tests found that the end-to-end application did not have much performance gain from the addition of more than 6 threads. Finally, the Dynamic version of the end-to-end application required less than 25% as many 42

cycles to process the same packet stream compared to the Static version. In an attempt to bridge the gap between the speed of ASIC chips and the ﬂexibility of general purpose processors, NPUs utilize parallel processing and special-purpose hardware and memory structure, as well as other techniques. While NPUs make it possible to deploy complex end-to-end applications into the network, high speed networks put heavy load on these devices making application optimization an important area of research. The simulator presented in this paper made development and analysis of two end-to-end application benchmarks as well as the kernels making up these applications. Through the development of these kernels and applications we explored several parallelization techniques. Using our simulator and testing methodology, we unveiled the performance characteristics of these kernels and application benchmarks on a typical NPU.

43

Chapter 7 Future Work
The simulator developed in this work provides a tool that can be used in a variety of future projects. Thus far, the simulator has been used by Gridley in his Master’s thesis on active network algorithm performance [9] and Tsudama to test his denial-of-service detection algorithm as part of his Master’s thesis [29]. As future work, several improvements could be made to the existing simulator including support for dedicated processing chips, larger cycle count capability, and updates necessary to model the current generation of NPUs. Other future work could include testing the existing end-to-end applications on an updated simulator to determine whether or not the performance problems found in this work have been overcome by the current generation of NPUs. If the same performance problems remain, further investigation into methods of designing parallel applications to avoid bottlenecks on NPUs will be required. Additionally, the parameters of the NPU architecture could be adjusted to determine which changes lead to performance improvements. However, if performance bottlenecks are not found on current NPUs, then larger scale end-to-end applications should be developed to push the performance limits of the architecture 44

and reveal new bottlenecks. The benchmark suite could be extended by including additional kernels. The end-to-end applications could be extended to include these kernels or new endto-end applications could be developed to model other real-world scenarios. Optimization of the current and future kernels and end-to-end applications will continue to be an open area of research.