The
following is a product proposal describing a port of our optimizer to a
new processor (click here for
description of the existing adaptations of our optimization
technology). Please contact
us
if you like to further discuss the
possibilities of
porting our optimization technology to a processor of your choice.

The modern DSP’s/CPU’s
requires a tight interface
between compiler and hardware architecture in order to achievehigh utilization of the
available resources.
Compilers have not shown the ability to meet the needs of programmers
on critical code. As architecture scales up, it is becoming so complex
that human programmers can't deal with the scheduling and tracking of
so many registers and execution units. The result may be an
architecture that can't be programmed to apply most of its resources on
real-life algorithms.

The goal of this
project is to develop Code Optimizer ( dco
) - an optimizing code system for the Digital Signal Processor(s) or CPU ( referred as target ). This will be a software package
specifically designed to optimize the target code by taking full
advantage of the options and features provided by the target
microprocessor.

Thedco will
be used to optimize code generated by a compiler. The programmer uses a
compiler (C/C++, FORTRAN etc.) to translate his code into targets
assembly code. This code would be used as an input to dco. The
output, generated by the dco, will be a highly optimized targets assembly code
that is
logically identical to the original one; dco will
rearrange the existing code, performing multi-issue optimization, loop
unrolling and vectorization, reassigning available registers, etc. To
create a final object file the generated code should be assembled.

Note that dcowill
not require preprocessing or any other involvement from
the user. It will be fully automated and it would be possible to
incorporate dco into makefiles or other product generation tools.

The use
of dco will
greatly improve the quality of the generated code.It, therefore, may prove to be a vital contribution
to the
production of a winning solution for your Digital Signal Processor(s)
or CPU.

dco will be a software package that optimizes a target
code by
taking full advantage of the options and features provided by the
target microprocessor.The
following is a
description of some of the optimization techniques that will be
provided by the package.

Auto parallelization:Auto parallelization utilizes multiple execution units ( "cores" ) found on many
modern processors and offers powerful auto-parallelization
that capable to identify code-patterns that are suitable for
parallelization, and create the optimized code that will be executed by
all the cores available.

Loop
unrolling:Loop
unrolling duplicates the body of a loop specified number of times
creating piece of code with more opportunities for optimization.

Vectorization:Vectorization
overlaps epilog code execution for current loop iteration with prolog
code execution for the next loop iteration.

Multi-issuing:Multi-issuing
attempts to
utilize Instruction Level Parallelism ( ILP ) of the target processor
leading to more than one instruction being executed at instruction
cycle.

Art
substitution:Art
substitution replaces parts of the code by the logically equivalent
code sequences. The generated code is then evaluated in order to find
the optimal one.

By
default dco will
perform most of the optimizations that are available. It will be
possible to enable or disable any number of the optimization techniques.

dco will
operate on basic block, although
most of
optimizations and
scheduling will be done across the blocks. A
basic block is a sequence of a targets non-branching instructions in
which flow of control enters at the top. dcowill assumes as
a basic block any sequence of instructions preceded and/or followed by
a label or a branch instruction.

After
scanning the input
targets assembly source, dco will performs comprehensive data flow analysis
calculating
resources needed and available for execution of the instructions of the
code. It then will extract a basic block(s) to be optimized.

When the
basic block is
selected, the static optimization will be performed and dynamic memory
map will be built. The code generated by the static optimization will
then be used to build the data flow graph..

From the
data flow graph dco will
perform art substitutions and resource dependency reduction and
generate patterns of code that are organized into multi-instruction
units. dco will attempt to
generate the fastest assembly
code that is logically identical to the input code.

At this stage
it is difficult
to come up with the reasonable estimates of the dco
performances. However, this package would be based on the technology
developed and implemented for the Intel’s i860
RISC processor ( dco860 ), DEC’s Alpha
Axp RISC processor ( ago ) , Analog Devices
ADSP-2106x ( SHARC )
family of
DSP’s ( compactor ), Analog
Devices TS00x ( tigerSHARC ) DSP ( dco ),
Freescale’s StarCore DSP ( sco ) and
recently for the x86 family of processors ( dco ). The
modern DSP’s/CPU’s ( as powerful as they
are ) contain little of what haven’t been successfully
implemented in the series of the optimizers already implemented ( i860, Alpha and SHARC).
All of the
supported CPUs provide multifunctional units that allow to execute more
than one instruction per cycle ( i860 - 2, Alpha and SHARC - 4 ), Alpha, SHARC and x86
support
conditional instruction execution, i860supports
data types contained across two registers, tigerSHARC
and x86
support SIMD etc.

All of
the implemented code
optimizers achieve significant code improvements on variety of
applications under different optimizing compilers, as summarized in the
following table.

The
development of the
product will be done using our resources. You will provide target
development package ( compiler, assembler, linker, documentation etc. )
and target platform and/or simulator to execute code.

Dynamic
memory disambiguation
is one of the most powerful optimization of the inner loop body
supported by dco. This technique allows, at the time of program
execution ( dynamically
), resolve memory conflicts of the code. To achieve that, dco
generates two versions of the code: one assuming that memory conflicts
are not resolved and the second assuming that memory conflicts are
resolved ( which is usually much more efficient ). At the run time,
depending on the actual setting of the memory pointers, the appropriate
code is executed.

As an example,
consider the
kernel of the linpack benchmark suite ( called daxpy ):

for ( i = 0; i < n; i++ )

{

dy[i]
= dy[i] + da*dx[i];

}

When compiled
by the Alpha
Axp
compiler, the following code is generated ( fully optimized ):

$43:

ldt $f1,0($18)

mult $f17,$f1,$f1

ldt $f10,0($20)

addt $f10,$f1,$f10

addl $2,1,$2

addq $18,8,$18

cmplt $2,$16,$1

stt $f10,0($20)

addq $20,8,$20

bne $1,$43

$41:

Unrolling this
code by 2 and
performing dynamic memory disambiguation dco produces
the following result:

$43:lda$1,
1

addl$2,
$1, $1

subl$1,
$16, $1

blt$1,
.lualpha_41

.align
4

.lualpha_42:

ldt$f1,
0($18)

mult$f17,
$f1, $f1

ldt$f10,
0($20)

addt$f10,
$f1, $f10

addl$2,
1, $2

addq$18,
8, $18

cmplt$2,
$16, $1

stt$f10,
0($20)

addq$20,
8, $20

bne$1,
.lualpha_42

br$31,
.lualpha_45

.lualpha_41:

lda$1,
15

addq$18,
$1, $1

subq$1,
$20, $1

bge$1, .mcalpha_48

.align
4

.wlalpha_49:

ldt$f12,
0($18)

ldt$f1,
8($18)

mult$f17,
$f12, $f12

mult$f17,
$f1, $f1

ldt$f11,
0($20)

ldt$f10,
8($20)

addt$f11,
$f12, $f11

addt$f10,
$f1, $f10

addl$2,
2, $2

addl$2,
1, $1

addq$18,
16, $18

subl$1,
$16, $1

stt$f11,
0($20)

stt$f10,
8($20)

addq$20,
16, $20

blt$1,
.wlalpha_49

Br $31,
.mlalpha_54

.mcalpha_48:

lda$1,
1

addq$18,
$1, $1

subq$1,
$20, $1

ble$1,
.mcalpha_55

.align
4

.wlalpha_56:

ldt$f12,
0($18)

ldt$f1,
8($18)

mult$f17,
$f12, $f12

mult$f17,
$f1, $f1

ldt$f11,
0($20)

ldt$f10,
8($20)

addt$f11,
$f12, $f11

addt$f10,
$f1, $f10

addl$2,
2, $2

addl$2,
1, $1

addq$18,
16, $18

subl$1,
$16, $1

stt$f11,
0($20)

stt$f10,
8($20)

addq$20, 16, $20

blt$1,
.wlalpha_56

Br $31,
.mlalpha_54

.mcalpha_55:

.align
4

.wlalpha_61:

ldt$f12,
0($18)

mult$f17,
$f12, $f12

ldt$f11,
0($20)

addt$f11,
$f12, $f11

stt$f11,
0($20)

ldt$f1,
8($18)

mult$f17,
$f1, $f1

ldt$f10,
8($20)

addt$f10,
$f1, $f10

addl$2,
2, $2

addl$2,
1, $1

addq$18,
16, $18

subl$1,
$16, $1

stt$f10,
8($20)

addq$20,
16, $20

blt$1,
.wlalpha_61

Br $31,
.mlalpha_54

.mlalpha_54:

addl$2,
$31, $1

subl$1,
$16, $1

blt$1,
.lualpha_42

.lualpha_45:

$41:

The
unrolled loop:

.wlalpha_61:

ldt$f12,
0($18)

mult$f17,
$f12, $f12

ldt$f11,
0($20)

addt$f11,
$f12, $f11

stt$f11,
0($20)

ldt$f1,
8($18)

mult$f17,
$f1, $f1

ldt$f10,
8($20)

addt$f10,
$f1, $f10

addl$2,
2, $2

addl$2,
1, $1

addq$18,
16, $18

subl$1,
$16, $1

stt$f10,
8($20)

addq$20,
16, $20

blt$1,
.wlalpha_61

has memory
conflict:

stt$f11, 0($20)

ldt$f1,
8($18)

The
generated code solves
this conflict by producing version of the loop with the assumption that
memory conflict doesn’t exist ( loop labeled .wlalpha_49
- note that all the memory reads ( ldt )
are performed before memory writes ( stt ) )
and version of the loop without such an assumption ( loop labeled .wlalpha_61
- note that that order of the instruction of the memory conflict is
preserved ( ldt
$f1,8($18) is
following stt
$f11,0($20) )

In most ( if
not all )
instances of the linpack benchmark execution, code labeled .wlalpha_49:
will
be executed thus bringing performance improvement over the original
code to 40%.

Vectorization
is another
powerful optimization of the inner loop body. It overlaps epilog code
execution for current loop iteration with prolog code execution for the
next loop iteration. Essentially, the two consecutive loop iterations
are fitted in the block of the size of the loop iteration ( of size n
). m instruction are chosen from the bottom of the first iteration and
combined with n - m instructions from the top of the second iteration.
The generated m + ( n - m ) = n instructions are optimized. This is
done for all m from 1 to n-1 and the best resulting code is chosen. Of
course, all necessary checkups to preserve logic of the code are
performed by dco and resulting code.