Tuesday, 28 January 2014

While transforming from single instruction single data (SISD) to single instruction multiple data (SIMD) OR multiple instruction single data (MISD) OR multiple instruction multiple data (MIMD),

almost every layer of computer system hierarchy gets affected.

SISD, SIMD, MISD and MIMD ,all involves the variations of the VON-NEWMEN concept.

Chipset:

A chipset is a set of electronic components in an integrated circuit that manages the data flow between the processor, memory and peripherals. Because it controls communications between the processor and external devices, the chipset plays a crucial role in determining system performance.

In the above block diagrram, we have multiple instructions IS1,IS2,IS3 given to

individual control units which are further connected to processing units.There are latches between each processing unit which stores the data being processed by the previous stage.

Lets take an example of image processing:

Suppose we are having 3 instructions as decreasing the intensity of the image(IS1),encoding a message int the image(IS2) and then decreasing the size of the image(IS3).

All the instructions are given to CUs parallely. Data of image is taken by PU1 from main memory and IS1 is processed. The processed data is then stored in the latch provied after PU1. In the meanwhile data is being saved, data of next image start transferring to PU1. The data present in first latch is processed in PU2. The process goes on until nall the data is being processed.

The above example shows that there is a kind of pipelining being used in data processing.

Systollic arrays : It is a pipe network arrangement of processing units called cells. It is a specialized form of parallel computing, where cells (i.e. processors), compute data and store it independently of each other. Each cell shares the information with its neighbours immediately after processing.

Multiple Instruction Multiple Data :

MIMD is a technique employed to achieve parallelism. Machines using MIMD have a number of processors that function asynchronously and independently. At any time , different processors may be executing different instrucyions on different pieces of data. MIMD machines can be either of shared memory or distributed memory categories.

Shared memory models :

The processors are all connected to globally available memory . The OS usually maintains the coherence.

Examples of shared memory multiprocessors are :

1.NUMA (Non uniform memory access) : Under NUMA a processor can access its own local memory faster than non-local memory. The benefits of NUMA are limited to particular workloads, notably on servers where the data are often associated strongly with certain tasks or users.

2.UMA (Uniform Memory Access) : All the processors share the physical memory uniformly. In a UMA architecture, access time to a memory location is independent of which processor makes the request or which memory chipcontains the transferred data.

Distributed memory models :

In distributed memory MIMD machines, each processor has its own individual memory location. Each processor has no direct knowledge about other processor's memory. For data to be shared, it must be passed from one processor to another as a message.

Friday, 24 January 2014

Pipelining is an implementation technique where multiple instructions are overlapped in execution. The computer pipeline is divided in stages. Each stage completes a part of an instruction in parallel. The stages are connected one to the next to form a pipe - instructions enter at one end, progress through the stages, and exit at the other end.

·F: Fetch

·D: Decode

·C: Calculating the address of operand

·E: Execute

·W: Write back

Types of hazards:

1. Structural hazards

2. Data hazards

3. Control hazards

Structural Hazards

Structural hazards occur when a certain resource (memory, functional unit) is requested by more than

one instruction at the same time.

Clock cycle → 1 2 3 4 5 6 7 8 9 10 11 12

·Instr. i F D C F E W

·Instr. i+1 F D C F E W

·Instr. i+2 F D C F E W

·Instr. i+3 F D C F E W (4 is stalled)

·Instr. i+4 F D C F E W

1st F is for fetching instruction and 2nd for fetching operand.

In class we were told after decode there are 3 execute operation but in this first 2 are given specific task to help it understand better.

Penalty

: 1 cycle

Data Hazards

We have two instructions, I1 and I2. In a pipeline the execution of I2 can start only when I1 has been executed.

If in a certain stage of the pipeline, I2 needs the result produced by I1, but if this result has not yet been generated,then we have a data hazard and thus we can say that I2 is dependent on I1 i.e Data Dependency.

I1: MUL R2,R3 R2 ← R2 * R3

I2: ADD R1,R2 R1 ← R1 + R2

Before executing its F stage, the ADD instruction is stalled until the MUL instruction has written the result into R2.

Penalty

: 2 cycles

Control Hazards

This type of hazard is caused by uncertainty of execution path, branch taken or not taken.It results when we branch to a new location in the program, invalidating everything we have loaded in our pipeline.

Pipeline stall until branch target known.

Clock cycle → 1 2 3 4 5 6 7 8 9 10 11 12

BR TARGET F D C F E W

target F D C F E W(2,3,4 are stalled)

next instruction of target is not exexuted until the target is fetched.

Example of Inordering and Reordering

Z ← X + Y C←A*B

Instr. 1> R1 ←Mem(X) Instr. 5> R5 ←Mem(A)

Instr. 2> R2 ←Mem(Y) Instr. 6> R6 ←Mem(B)

Instr. 3> R3 ←R1+R2 Instr. 7> R7 ←R5*R6

Instr. 4> Mem(Z) ←R3 Instr. 8> Mem(C) ←R7

Clock cycle → 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Instr. 1 F D I C F E W

Instr. 2 F D I C F E W

Instr. 3 F D I C F E W

Instr. 4 F D I C F E W

now if the instructuion are in inorder i.e 1 2 3 4 5 6 7 8 then, then there is stalling

But if there is reordering i.e

1 2 5 6 3 4 7 8 then the stalling can be reduced.

Static pipeliningDynamic Pipelining

>There is no reordering is instructions. >There is reordering of instuctions.

>If one clock cycle is empty in one instruction >No necessity of empty inst. Due to reordering

then another it will also be empty in other instruction can be executed.

the following instructions.

Mechanism for Instruction Pipelining:

Here use of caching,collision avoidance,,multiple functional units,register tagging,and internal forwarding is explained to smoothen the pipeline flow and to remove

bottlenecks.

Prefetch Buffer:

In one access time one block of memory is loaded into prefetch buffer.The block access time can be reduced using cache or interleaved memory modules.

Types of Prefetch Buffers:

1. Sequential Buffers

2. Target Buffers

3. Loop Buffers

Sequential and Target Buffers:

Sequential instructions are loaded into pair of sequential buffers for in sequence pipelining.Instructions from a branch target are loaded into a

pair of target buffers for out-of-sequence pipelining.Both buffers operate in FIFO fashion.A conditional branch like a if condition causes both

sequential and target buffers to fill with instructions.Instructions are taken on the basis of the branch condition from the corresponding buffer and discards

the instruction in other buffer.Within each pair one can usebuffer to load instructions from memory and other to feed instructions to pipeline.

Loop Buffer:

The buffers stores sequential statements written in small loops.Loop buffers are maintained by fetch stage of pipeline.Prefetched instructions in loop body will

be executed until all iterations are complete execution.It executes in two steps:

1.First step contains the pointer of the instruction just ahead of current instruction.This saves instruction fetch time from memory.

2.It recognizes when the target of a branch falls within the loop boundary.

In order to resolve data or resource dependences among successive instructions entering into pipeline reservation table is used with each functional unit.

Operations wait in table until their dependences are resolved.This removes bottleneck in pipeline.

Internal Data Forwarding:

Why to do : The throughput of pipelined processor can be improved with internal data forwarding between functional units.Moreover some memory operations can be replaced by register transfer operations.

Types:

1.Store Load forwarding

2.Load Load forwarding

3.Store Store forwarding

Store Load forwarding:

In this load operation (LD,R2,M) from memory to register can be replaced by move instruction (MOVE,R2,R1) from register R1 to register R2 s,since register transfers are faster.

Wednesday, 22 January 2014

A pipeline is a set of data processing elements connected in series, where the output of one element is the input of the next one. The elements of a pipeline are often executed in parallel or in time-sliced fashion ,in that case, some amount of buffer storage is often inserted between elements.

Example of pipeling :

Consider the assembly of a car: assume that certain steps in the assembly line are to install the engine, install the hood, and install the wheels (in that order, with arbitrary interstitial steps). A car on the assembly line can have only one of the three steps done at once. After the car has its engine installed, it moves on to having its hood installed, leaving the engine installation facilities available for the next car. The first car then moves on to wheel installation, the second car to hood installation, and a third car begins to have its engine installed. If engine installation takes 20 minutes, hood installation takes 5 minutes, and wheel installation takes 10 minutes, then finishing all three cars when only one car can be assembled at once would take 105 minutes. On the other hand, using the assembly line, the total time to complete all three is 75 minutes. At this point, additional cars will come off the assembly line at 20 minute increments.

There are two types of pipelining :-

1.Linear pipelining

In this, processing stages are placed in cascading manner ,such that each stage is executed one after the other i.e a stage is executed only when previous stages are being executed , to perform a fixed function over a stream of data.

They are static pipelines because they are used to nperform fixed functions.

In this, every stage transfers data to the next stage at same point of time(i.e at same clock pulse) using latches and flip-flops.The pipelining stages are combinational logic circuits . so, it is desired to have approximately equal delay in every stage.

èAsynchronous linear pipeline

In this ,it is not necessary to have transfer of data in every stage, at the same point of time.Data flow between adjacent stages is controlled by handshaking protocol.

Used in MPI (message passing interface).

Reservation table :

The reservation table mainly displays the time space flow of data through the pipeline for a function. Different functions in a reservation table follow different paths.

The number of columns in a reservation table specifies the evaluation time of a given function.

Determination of clock cycle 'Ƭ' of a pipeline :

Ƭ = max (Ƭi)ᴷ₁+d = Ƭm + d

OR Ƭ = m if (Ƭm >> d)

Where, Ƭm= maximum stage delay

Ƭi =time delay of stage Si

d = time delay of a latch

Pipeline frequency, f = 1⁄ Ƭ

Suppose if we are not using pipeling , we have 'n' number of tasks and 'k' number of stages

Then, required clock cycles = nk

Total time required, T₁ = nk * Ƭ

By using pipelining,

Required clock cycles = k+(n-1)

Total time required, Tk = [k +(n-1)]* Ƭ

Speed up factor, Sk = Ƭ₁ / Ƭk

=nk /[k+(n-1)]

Efficiency, Ek = Sk/k = nk/k[k+(n-1)]

= n/k+(n-1)

Throughput, Hk = number of tasks / total time

= n/[k+(n-1)]Ƭ

The larger the number 'k' of subdivided pipelined stages,the higher the potential speedup performance.

'k' cannot increase indefinitely due to practical constraints on cost,control complexity and circuitary.

PCR(performance to cost ratio)= f/(C+kh)

(where C =cost of all logic states, h=cost of each latch,k=no of stages)

Using the PCR , the optimal number of stages can be found out.

No of optimal stages , = sqrt(t*C/(d*h))

(where t = total flow through delay, d=latch delay)

2.Non linear/Dynamic Pipelining

A non-linear pipeline (also called dynamic pipeline) can be configured to perform various functions at different times. In a dynamic pipeline there is also feed forward or feedback connection.

Reservation Table/State Time diagram

The reservation table in non linear pipelining are intersterting as they don't follow a linear pattern. It dsplay time-space flow of data for evaluation of one or more functions. The checkmarks in each row correspond to time instants that particular stage will be used.

There may be multiple checkmarks in a row indicating repeated usage of same stage in different cycles.

STATE-TIME DIAGRAM

Let the sequence of stages in a task be S1->S2->S3->S2->S3->S1->S3->S1, then its state-time diagram would be:

1

2

3

4

5

6

7

8

S1

-------

-------

-------

S2

-------

-------

S3

-------

-------

-------

------ INDICATES THE ACTIVE STAGE IN A GIVEN CLOCK CYCLE

If two or more processes attempt to use same pipeline stage at the same time, Collision/Resource Conflict occurs.

To resolve the collisions,some scheduling is done.

Latency

Latency is defined as no of clock cycles between two initiations of a pipeline or the no of clock cycles after which the next task needs to be started.

Latencies that causes collisions are called forbidden latencies.

Forbidden latencies are detected by checking the distance between any 2 checkmarks in the same row of the reservation table.

For eg.

In the above table the forbidden latencies are 2,4,5,7.

Let m be the max forbidden latency

p be permissible latency values (values where collisions don't occur)

& n be the total number of clock pulses /column number in reservation table ,

Then

m<=n-1

and 1<=p<=m-1.

Collision vector

A collision vector is a m-bit binary vector C . The value of Ci=1 if i causes a collision and 0 if latency I is permissible.

For example in the above reservation table the forbidden latency are 2,4,5,7.

Latency cycle is a latency sequence which repeats the same subsequence indefinitely.

For example , one of the latency cycle in the above example could be 2,5,2,5,2,5,…,

This implies successive intiaitions of new task are by 2 and 5 cycles alternately.

Average latency

Avg latency is obtained by dividing the sum of all latencies by no of latencies in the cycle.

The avg latency is cycle 2,5,2,5,2,5.. would be (2+5)/2 = 3.5.

Constant cycle

Constant cycle is a latency cycle which contains only one latency value.

State Diagrams

State diagrams specify the permissible state transitions among successive initiations.

The next state is obtained by ORing the initial state and the latency time right shifted state.

NOTE :

Supercomputers in India

According to the latest news India's PARAM is listed among world's most power efficient supercomputer

Development of Advanced Computing (C-DAC) said its super computer--PARAM Yuva II was ranked on the first place inIndia in the Green500 List of Supercomputers in the World. PARAM YUVA II has been ranked number 9 in the Asia PacificRegion and stands at the 44th place in the world among the most power efficient systems as per the list that was announced on November 20, at the Supercomputing Conference (SC'2013) in Denver, Colorado, USA.

C-DAC is the 2nd organisation in the world to have carried the level-3 measurement of power as compared to performance for Green500 list, which is an indication of the most rigorous level of measurement exercise performed for such ranking.

The project is meant to create and promote healthy competition among the supercomputing initiatives in India and can substantially lead to significant supercomputing advancement in the nation. It list the top 500 supercomputer in india with regular updations. Below is the link

Wednesday, 15 January 2014

A Chinese university has built the world's fastest supercomputer, almost doubling the speed of the U.S. machine that previously claimed the top spot and underlining China's rise as a science and technology powerhouse.

FEATURES

->Developed by the National University of Defense Technology in central China's Changsha city. It is capable of sustained computing of 33.86 petaflops per second.

"Most of the features of the system were developed in China, and they are only using Intel for the main compute part," TOP500 editor Jack Dongarra, who toured the Tianhe-2 facility in May, said in a news release. "That is, the interconnect, operating system, front-end processors and software are mainly Chinese."

->CHINA- A SUPERCOMPUTING POWER

This computer has made China a recognized supercomputing power leaving everybody behind..

->HUGE EFFORT

It was developed by a team of 1300 scientists and engineers("such a huge effort").

With 16,000 computer nodes, each comprising two Intel Ivy Bridge Xeon processors and three Xeon Phi chips, it represents the world's largest

installation of Ivy Bridge and Xeon Phi chips,counting a total of 3,120,000 cores.

2.MEMORY

Each of the 16,000 nodes possess 88 gigabytes of memory.

The total CPU plus coprocessor memory is 1,375 TiB.

3.POWER

The system itself would draw 17.6 megawatts of power and including external cooling,

the system would draw an aggregate of 24 megawatts.

4.SPACE

The computer complex would occupy 720 square meters of space.

ARE THE TOOLS WHICH MEASURE THE PERFORMANCE OF SUPERCOMPUTERS FULLY RELIABLE?

A team, led by a professor from Germany's University of Mannheim, compiles the Top500list bi-yearly and the latest list of five fastest supercomputers remaind unchanged compared to the list released in June.

Per the Linpack benchmark test, Intel-powered Tianhe-2 is able to operate at 33.86 petaflop/sec, which is equivalent to 33,863 trillion calculations per second. Its closest competitors were Cray Inc's Titan with 17.59 petaflop/sec and IBM's Sequoia with 17.17 petaflop/sec.

The only change near the top was Switzerland's new Piz Daint supercomputer, which made it to the sixth spot with 6.27 petaflop/sec.

The Linpack benchmark test measures how quickly computers can crack a special type of linear equation to determine its speed. However, the benchmark does not take into consideration factors like the speed with which data can be transferred from one area of the system to another. This factor can influence the real world performance of the device.

"A very simple benchmark, like the Linpack, cannot reflect the reality of how many real application perform on today's complex computer systems,"said Erich Strohmaier. More representative benchmarks have to be much more complex in their coding, their execution and how many aspects of their performance need to be recorded and published. This makes understanding their behaviour more difficult."

IBM created five out of the 10 fastest supercomputers and its head of the computational sciences department at IBM's Zurich research lab, Dr Alessandro Curioni, said that the manner in which the list was calculated needed to be updated. He would voice the same concern at a conference in Denver, Colorado, which will be held this week.

"The Top500 has been a very useful tool in the past decades to try to have a single number that could be used to measure the performance and the evolution of high-performance computing," notes Dr Curioni,. "[But] today we need a more practical measurement that reflects the real use of these supercomputers based on their most important applications."

Tianhe-2 has been developed by China's National University of Defence Technology (NUDT) and has been set up in National Super Computer Center in Guangzhou.