This architecture consists of a square grid of processor/memory elements.



A single control unit broadcasts instructions which are carried out in lockstep byall the processors, each one using its own data from ts own memory. The arrayprocessor is well-suited to calculations on matrices.

SingleInstruction streamMultipleData stream) A computer that performs one operationon multiple sets of data. It is typically used to add or multiply eight or more sets ofnumbers at the same time for multimedia encoding and rendering as well as scientificapplications. Hardware registers are loaded with numbers, and the mathematicaloperation is performed on all registers simultaneously.

N synchronized PEs,all of which are under the controlof one CU.Each PEi

is essentially an ALU with attached working registers and localmemory PEMi

for the storage of distributed data.The CU also has its own main memoryfor storage of programs.The function of CU is to decode all instruction and determinewhere the decoded instructions should be executed.Scalar or control type instructions aredirectly executed inside the CU.Vector instructions are broadcasted to the PEs fordistributed execution.

Configuration II differs from configuration I in two aspects.First the local memriesattached to the PEs are replaced by parallel memory modules shared by all the PEs

Formally an SIMD computer is characterized by following set of parameters.

C=<N,F,I,M>

N=Number of PEs in the system.

F=A set of data routing functions.

I=Set of machine instructions

M=Set of masking schemes.

Masking and data routing mechanisms

In an array processor, vector operands can be specified by the registers to be used or bythe memory addresses to be referenced. For memory-reference instructions, each PEiaccesses the local PEMj, offset by its own index registerIi,TheIiregister modifies the

global memory address broadcast from the CU. Thus, different locations in differentPEMs can be accessed simultaneously with the same global address specified by the CU.The following. example shows how indexing can be used to address the local memoriesin parallel at different local addresses.

1 in order to convert the global address 100 to local addresses 100 +Ij = 100 + j for each PEMj' Within each PE, there should bea separate memory addressregister for holding these local addresses. However, if one wishes to address a row of thearrayA,say the ith rowA(i,j) for j = 0, 1, 2, . . . ,n-

1, all theIj registers will be reset tobe for

allj = 0,1,2,...,n-

1 inorder to ensure the parallel access of the entire row.

Example 5.2

To illustrate the necessity of data routing in an array processor, we showthe execution details of the following vector instruction in an array ofNPEs. The sumS(k)of the firstk

components in a vectorAis desired for each

k

from

0to

n-

I. LetA=(Ao,A1…..

,An-I)We need to compute the followingnsummations:

S(k)= iAifork= 0, 1, . . .,n-

1

Thesenvector summations can be computed recursively by going through the\

are routedtoRi+2for i = 0 to 5. In the final step, the intermediate sums inRiarerouted toRi+4for i = 0

to 3. Consequently, PE. has the final value ofS(k)fork=0, 1,2,...,7.

As far as the data-routing operations are concerned, PE7 is not involved (receiving butnot transmitting) in step 1. PE7 and PE6 are not involved in step 2. Also PE7, PE6, PE5,and PE4 are not involved in step 3. These un-

wanted PEs are masked off during thecorresponding steps. During theadditionoperations, PE0

is disabled in step 1; PE0

andPEl are made inactive in step 2; and PE0, PEl, PE2, and PE3 are masked off in step 3.The PEs that are masked off in each step depend on the operation (data-routing or arith-

metic-addition) to be performed. Therefore, the masking patterns keep changing in thedifferent operation cycles, as demonstrated by the example. Note that the masking androuting operations will be much more complicated when the vector lengthn>N.

Array processors are special-purpose computers for limited scientific applica tions. Thearray of PEs are passive arithmetic units waiting to be called for parallel-computationduties. The permutation network among PEs is under program control from theCU.However, the principles of PE masking, global versus local indexing, and datapermutation are not much changed in the different machines.

Inter PE communications

There arefundamental decisions in designing appropriate architecture of aninterconnection network for an SIMD machine.The decisions are made betweenoperation modes,control strategies,switching methodologies,and network topologies.

Operation Mode:

The types of communication can be identified :Synchronous and asunchronous.

Control strategy:

The control setting fumctions can be managed by a centralized controller or by individualswitching element.The later strategy is called distributed controland the first strategycorresponds to centralized control.

Switching Methodology:

The two major switching methodologies are circuit switching and packet switching.

Network topology:

The topologies can be grouped into two categories:static and dynamic.In static topologydedicated buses cannot be reconfigured.But links in dynamic category can bereconfigured.

The topological structure of an array processor is mainly characterized by the datarouting network used ininterconnecting the processing elements.Such network can bespecified by a set of data routing functions.

Static networks

Topologies in static network can be classified according to the dimensions required forlayout. Examples for one dimensional topologies include linear array.Two dimensionaltopology include ring, star, tree, mesh, and systolic array. Three dimensional topologiesincludecompletely connected

chordal ring,

3 cube, and 3 cube connected cyclenetworks.

Dynamic networks

Two classes of dynamic networks arethere. Single

stage versus multistage.

Single stage networks

A single stage network is a switching network with N input selectors (IS) and N outputselectors (OS).Each IS is essentially a 1 to D demultiplexer and each OS is an M to 1multiplexer where 1<D<N and 1<M<N.A single stage network withD=M=N is acrossbar switching network. To establish a desired connecting path different path controlsignals will be applied to all IS and OS selectors.

A single stage network is also called a reciculatingnetwork. Data

items may have toreirculate through the single stage

several times before reaching the final destination.The

number of recirculations needed depend on the connectivity in the single stagenetwork.In general,higher is the hardware connectivity,the less is the number ofrecirculations.

Multi stage networks

Many stages of an interconnected switch form the multistage network.They are describedby three characterizing features :switch box,network topology and control structure.Manyswitch boxes are used in multistage networksEach box is essentially an interchangedevice with two inputs and outputs.The four states of a switch box are:straight,exchange,upper broadcast and lower broadcast.

Mesh-Connected Illiac Network

A single stage recirculating network has been implemented in the Illiac IV arrayprocessor with 64 PEs.Each PEi

is allowed to send data to PEI+1,PEi-1,PE,i+r

PEi-r

wherer=√N.

Formally Illiac network is characterized by following four routing functions.

R+1(i)=(i+1)mod N

R-1(i)=(i-1)mod N

R+r(i)=(i+r)mod N

R-r(i)=(i-r)mod N

A reduced Illiac network is illustrated in Fig with N=16 and r=4.

R+1=(0 1 2 ……N-1)

R-1=(N-1………2 1 0)

R+4=(0 4 8 12)(1 5 9 13)(2 6 10 14 )(3 7 11 15)

R-4=(12 8 4 0)(13 9 5 1)(14 10 6 2 )(15 11 7 3)

This fig shows four PEs can be reached from any PE in one step,seven PEs in twosteps,and eleven PEs in three steps.In general it take I steps to route data from PEi

to anyother PEj

in an Illiac network of size N where I is upper bouded by I<

√N-1

Cube Interconnection Networks

The

cube network can be implemented as multi stage network for SIMDmachines.Formally an n dimensional network of N pes is specified by following n routingfunctions.Vertical lines connect vertices whose address

differ in the most significantbit position.Vertices at both ends of the diagonal lines differ in the middle bitposition.Horizontal lines differ in the least significant bit position.

─

Ci(an-1……ai+1aiai-1………..a0) for i=0,1,2……..n-1.

PARALLEL ALGORITHMS FOR ARRAY PROCESSORS

The original motivation for developing SIMD array processors was to perform parallelcomputations on vector or matrix typesofdata. Parallel processing algorithms have beendeveloped by many computer scientists for SIMD computers. Important SIMDalgorithms can be used to perform matrix multiplication, fast Fourier transform (FFT),matrix transposition, summationofvector elements, matrix inversion, parallel sorting,linear recurrence, boolean matrix operations, and to solve partial differential equations.We study below several representative SIMD algorithms for matrix multiplication,parallel sorting, and parallel FFT. We shall analyze the speedupsofthese parallelalgorithms over the sequential algorithms on SISD computers. The implementationofthese parallel algorithms on SIMD machines is described byconcurrentALGOL. Thephysical memory allocations and program implementation depend on the specificarchitectureofa given SIMD machine.

SIMD Matrix Multiplication

Many numerical problems suitable for parallprocessing can be formulated as matrixcomputations. Matrix manipulation is frequently needed in solving linear systemsofequations. Important matrix operations include matrix multiplication, L-Udecomposition, and matrix inversion. We present below two parallel algo-

rithms formatrix multiplication. The differences between SISD and SIMD matrix algorithms arepointed out in their program structures and speed performances. In general, the inner loopofa multilevel SISD program can be replaced by one or more SIMD vector instructions.

cumulative multiplications to be performed in Eq. 5.22. A cumulativemultiplication refers to the linked multiply-add operation c= c +axb.The addition ismerged into the multiplication because the multiply is equivalent to multioperandaddition. Therefore, we can consider the unit time as the time required to perform onecumulative multiplication, since add and multiply are performed simultaneously.

In a conventional SISD uniprocessor system, the n3 cumulative multiplications arecarried out by a serially coded program with three levelsofDO loops corresponding tothree indices to be used. The time complexityofthis sequential program is proportionalton3,as specified in the following SISD algorithmformatrix multiplication.

AnO(n3)algorithm for SISD matrix multiplication

Fori= I tonDo

Forj= I tonDo

Cij= 0 (initialization)

Fork= I tonDo

Cij=Cij+aik bij(scalar additive multiply)

End ofkloop

End ofjloop

End ofiloop

Now, we want to implement the matrix multiplication on an SIMD computer withnPEs.The algorithm construct depends heavily on the memory allocations of theA, B,and Cmatrices in the PEMs. Column vectors are then stored within the same PEM. Thismemory

allocation scheme allows parallel access of all the elements in each row vectorof the matrices. Based in this data-distribution, we obtain the following parallelalgorithm. The two parallel do opera

tions correspond tovector loadforinitialization

and

vector multiplyfor the innerloop of additive multiplications. The time complexity hasbeen reduced toO(n2).

Therefore, the SIMD algorithm isntimes faster than the SISD algorithm for matrixmultiplication.

AnO(n)

algorithm for SIMD matrix multiplication

Fori

= I tonDo

Par fork= I tonDo

Cik= 0(rector load)

Forj= I tonDo

Par fork= 1 tonDo

Cik=Cik+aij.bjk (vector multiply)

End ofjloop

End ofiloop

It should be noted that thevector loadoperation is performed to initialize the row vectorsof matrix C one row at a time. In thevector multiplyoperation, the

an O(n log2n) can bedevised to multiply two n xn matrices a and b.Let n=2m.Consider an array processorwhose n2

=22mpes are located at the 22m

vertices of a 2m cube network.A 2m cubenetwork can be considered as two (2m-1) cube networks linked

together by2mextraedges. In Figure a 4-cube network is constructed from two 3-cube networks by using 8extra edges between corresponding vertices at the corner positions. For clarity, wesimplify the 4-cube,drawing by showing only one of the eight fourth dimensionconnections. The remaining connections are implied.

Let(P2m-l

P2m-2... Pm

Pm-l. .,PI

PO)2)be the PE address inthe2mcube. We can achieve theO(nlog2n)compute time only if initially the matrix elements are favorably distributed inthe PE vertices. Thenrows of matrixAare distributed overndistinct PEs whoseaddresses satisfy the condition

P2m-lP2m-l...Pm =Pm-lPm-2.

as demonstrated in Figure5.20afor the initial distribution of four rows of the matrixAina 4 x 4 matrix multiplication(n= 4,m= 2). The four rows ofAare then broadcast overthe fourth dimension and front to back edges, as marked by

row

numbers in Figurea.

Thencolumns of matrixB(or thenrows of matrixB')are evenly distributed over thePEs of the2mcubes, as illustrated in Figure5.2Oc.The four rows ofB'are thenbroadcast over the front and back faces, as shown in Figure5.20d.Figure 5.21 shows thecombined results ofAandB'broadcasts with the inner product ready to be computed.The n-way broadcast depicted in Figure5.20band5.20d

elements on a mesh-connected(llIiac-lV-Iike) processor array inO(n)routing and comparison steps. This shows aspeedup of O(log2 n)over the best sorting algorithm, which takesO(nlog2 n) steps on auniprocessor system. We assume an array processor withN= n2

identical PEsinterconnected by a mesh network similar to llIiac-IV except that the PEs at the perimeterhave two or three rather than four neighbors. In other words, there are nowraparoundconnections in this simplified mesh network.

Eliminating the wraparound connections simplifies the array-sorting algorithm. The timecomplexity of the array-sorting algorithm would be affected by, at most, a factor of two ifthe wraPiiround connections were included.

Two time measures are needed to estimate the time complexity of the parallel

sortingalgorithm. Lett Rbe therouting timerequired to move one item from a PE to one of itsneighbors, andtcbe thecomparison timerequired for one comparison step. Concurrentdata routing is allowed. Up toNcomparisons may be performed simultaneously. Thismeans that a comparison-interchange step between two items in adjacent PEs can be donein2tR+tctime units (route left, compare, and. route right). A mixture of horizontal andvertical comparison interchanges requires

at least4tR+tctime units.

The sorting problem depends on the indexing schemes on the PEs. The PEs may beindexed by a bijection from{1,2,...,n}x{1,2,...,n}to{0,1,...,N-1},whereN=n2.Thesorting problem can be formulated as the moving of the jth smallest element in the PEarray for all j = 0, 1, 2,...,N-

1. Illustrated in Figure are three indexing patterns formedafter sorting the given array in partawith respect to three'different ways for indexing the

PEs. The pattern in partbcorresponds to arow-majored indexing,partccorresponds to ashuffied row-majorindexing, and is based on asnake-like row-major indexing.Thechoice of a particular indexing scheme depends upon how the sorted elements will beused. We are interested in designing sorting algorithms which minimize the total routingand comparison steps.

The longest routing path on the mesh in a sorting process is the transposition of twoelements initially loaded at opposite corner PEs, as illustrated in Figure 5.24. Thistransposition needs at least4(n-

1) routing steps. This means that no

algorithm can sortn2

elements in a time of less thanO(n).In other words, anO(n)sorting algorithm isconsidered optimal on a mesh ofn2

PEs. Before we show one such optimal sortingalgorithm on the mesh-connected PEs, let us review Batcher'sodd-even merge sortoftwo sorted sequences on a set of linearly connected PEs shown in Figure. Theshuffleandunshuffleoperations can each be implemented with a sequence of interchange operations(marked by the double-arrows in Figure). Both the perfect shuffle and its inverse(unshuffle) can be done ink-

1 interchanges or2(k-

1) routing steps on a linear array of2kPEs.

Batcher's odd-even merge sort on a linear array has been generalized by Thompson andKung to a square array of PEs. Let M(j,k)be a sorting algorithm for merging two

sortedj-by-k/2 subarrays to form a sortedj-by-karray, where j andkare powers of 2 andk> 1.The snakelike row-major ordering is assumed in all the arrays. In the degenerate case ofM(l,2), a single comparison-interchange step is sufficient to sorttwo unit subarrays.Given two sorted columns of length j ~ 2, the M(j, 2) algorithm consists of the followingsteps:

Example 5.6: The M(j, 2) sorting algorithm

J 1: .Move all odds to the left column and all evens to the right in2tktime.

In this section, we describe the functional organization of an associative array

processorand various parallel processing functions that can be performed on an associativeprocessor. We classify associative processors based on associative-

memoryorganizations. Finally, we identify the major searching applications of associativememories and associative processors. Associative processors have been built only asspecial-purpose computers for dedicated applications in the past.

Associative Memory Organizations

Data stored in an associative memory are addressed by their contents. Inthis sense,associative memories have been known ascontent.-addressable memory.

Parallel search memoryandmultiaccess memory.The major advantage of assosiativememory over RAM is its capability of performing

applications., such as the storage and retrieval of rapidly changing databases, radar-

signal tracking, image processing, computer vision, and artificial intelligence. The majorshortcoming of associative memory is its much increased hardware cost. Recently, thecost of associative memory is much Ihigher than that of RAMs.

The structure of AM is modeled in fig.The associatuive memory array consists of nwords with mbits per word.Each cell in the array consists of a flip flop associated withsome comparison logic gates for pattern match and read write control.A bit slice is avertical column of bit cells of all the words at the same position.

Each bit cell Bij

can be written in,read out,or compared with an external interigatingsignal.The comparand register C=(C1,C2,…………..Cm) is used to hold the key operandbeing searched for .The masking registerM=(M1,M2,………..Mm) is used to enable the bitslices to be involved in the parallel comparison operations across all the word in theassociative memory.

In practice, most associative memories have the capability ofword paralleloperations;that is, all words in the associative memory array are involved inthe parallel searchoperations. This differs drastically from theword serialoperations encountered in RAMs.Based on how bit slices are involved in the operation, we consider below two differentassociative memory organizations:

The bit parallel organization In a bit parallel organization, the comparison process isperformed in a parallel-by-word and parallel-by-bit fashion. All bit slices which are notmasked off by the masking pattern are involved in the comparison process. In thisorganization, word-match tags for all words are used (Figure5.34a).Each cross point inthe array is a bit cell. Essentially, the entire array of cells is involved in a searchoperation.

Bit serial organization The memory organization in Figure5.34boperates with one bitslice at a time across all the words. The particular bit slice is selected by an extra logicand control unit. The bit-cell readouts will be used in subsequent bit-slice operations. Theassociative processor STARAN has. the bit serial

memory organization and the PEPE hasbeen installed with the bit parallel organization.

The associative memories are used mainly for search and retrieval of non-

numericinformation. The bit serial organization requires less hardware but is slower in speed. Thebit parallel organization requires additional word-match detection logic but is fasterin'speed. We present below an example to illustrate the search operation in a bit parallelassociative memory. Bit serial associative memory will be presented in