Pipelining

I have just begun reading a PDF on Assembly language. One of the terms it mentioned and encouraged the reader to research further is "pipelining". I looked this up and read a brief description on WikiPedia; a section of the article stated: "In computing, a pipeline is a set of data processing elements connected in series, so that the output of one element is the input of the next one."

I am not completely clear on what "so that the output of one element is the input of the next one." means. Does this refer to a section of data being output from one element then being absorbed by another for further processing\transfer, or is it possibly referring to one pipeline element outputting data then being readied to take in and process new data in an "opposite" direction-- one dataset out, one dataset in?

What they want you to do is take the CPU's pipeline into consideration when writing the code. On the newer x86 CPU (486 onward) you can get significant speed boosts by ordering stuff so stages of the pipeline don't stall wiating for other stage results.

ex.

xor cx,cx
mov something, cx
mov di, whatever
mov ax,[di]

The load of DI and its use in the next instruction are in close proximity and can cause a stall. The code will work, it'll just be somewhat slower than something like this:

mov di, whatever
xor cx,cx
mov something, cx
mov ax,[di]

Where the CX ops were something that had to happen anyway and interposing them between the DI load and usage doesn't have any effect on the semantics of the code.

Another thing is with flag setting and usage. Newer CPU's have branch prediction logic that will try to prefetch cache lines of the most likely path to be taken. If you give the pipeline some warning about the conditions of a conditional branch that branch can happen faster.

ex.

mov cx,1234
mov dx,4567
add ax,something
jnz someplace

The setting of ZF happening right before the jump doesn't give the branch prediction much help. If you could interpose some other instructions between the setting of ZF and its usage the branch predictor can do a better job. NAturally, those instructions had better not be ones that whack ZF -- like the MOV's.

add ax,something
mov cx,1234
mov dx,4567
jnz someplace

Now the pipe and branch predictors can notice ZF doesn't change between getting set, and they've got two instructions worth of time to start fetching the code at "someplace" in the background.

These sort of optimization "rules" have changed all the time on different generations of CPU's, so its hard to say what is "best" anymore unless you have a specific target in mind.

What they want you to do is take the CPU's pipeline into consideration when writing the code. On the newer x86 CPU (486 onward) you can get significant speed boosts by ordering stuff so stages of the pipeline don't stall wiating for other stage results.

ex.

xor cx,cx
mov something, cx
mov di, whatever
mov ax,[di]

The load of DI and its use in the next instruction are in close proximity and can cause a stall. The code will work, it'll just be somewhat slower than something like this:

mov di, whatever
xor cx,cx
mov something, cx
mov ax,[di]

Where the CX ops were something that had to happen anyway and interposing them between the DI load and usage doesn't have any effect on the semantics of the code.

Another thing is with flag setting and usage. Newer CPU's have branch prediction logic that will try to prefetch cache lines of the most likely path to be taken. If you give the pipeline some warning about the conditions of a conditional branch that branch can happen faster.

ex.

mov cx,1234
mov dx,4567
add ax,something
jnz someplace

The setting of ZF happening right before the jump doesn't give the branch prediction much help. If you could interpose some other instructions between the setting of ZF and its usage the branch predictor can do a better job. NAturally, those instructions had better not be ones that whack ZF -- like the MOV's.

add ax,something
mov cx,1234
mov dx,4567
jnz someplace

Now the pipe and branch predictors can notice ZF doesn't change between getting set, and they've got two instructions worth of time to start fetching the code at "someplace" in the background.

These sort of optimization "rules" have changed all the time on different generations of CPU's, so its hard to say what is "best" anymore unless you have a specific target in mind.

Purple Avenger's reply mentions some good points, but fails to describe how pipelining actually works. If you think of a basic (naive) design for a processor, you'll probably have it read an instruction from instructional memory (or cache), parse the instruction, read register values from the register file, send those values to an ALU, send the ALU result to a stage for memory I/O, and send the result there (if not a mem op, then no change) back to the register file to save. The total time for this can be quite long.

Now suppose that we know that the ALU takes 1/3 of the total time to handle one instruction (this is a fictional example btw). If we cut the processor into independent section as mentioned above, with state registers for passing values between them, we can reduce our clock period to 1/3 of it's original (or triple the frequency) since that's the longest period needed for an intermediate stage. For any given instruction, it will take just as long to compute; however, you can have n instruction computing separately along the pipeline (where n is the number of pipeline stages). End result is that you get a faster speed.

For crazy designs (aka whatever's in your box most likely), there will be many pipelines and multiple data paths, both of which are essentially multipliers in both computing output and design complexity.

There's more info from Wikipedia but I'll admit I didn't actually read through all of that.

Hi. so this is actually a continuation from another question of mine[Here](https://www.daniweb.com/programming/software-development/threads/506795/dynamically-add-values-into-datagridview-cell-from-listbox-vb2010) but i was advised to start a new thread as the original question …

I have a 2d matrix with dimension (3, n) called A, I want to calculate the normalization and cross product of two arrays (b,z) (see the code please) for each column (for the first column, then the second one and so on).
the function that I created to find the ...

Write a C program that should create a 10 element array of random integers (0 to 9). The program should total all of the numbers in the odd positions of the array and compare them with the total of the numbers in the even positions of the array and indicate ...