High-level synthesis, EDA tools, compilers and more

Some time ago I did a small experiment to automatically generate RTL from GMP (GNU multi-precision integer) programs (as in gmplib.org). These programs use API calls to perform arbitrary-precision integer arithmetic for use in cryptography, conducting/supporting mathematical proofs and for purposes of scientific computing in general.

I have developed a concept implementation that is part of HercuLeS HLS (http://www.nkavvadias.com/hercules/). Essentially it is a set of additional features to HercuLeS. I actually used the fgmp (public domain) implementation and API, due to GMP’s licensing. The extensions (coded within a few days or so) did not yet generate synthesizable code, but I was able to generate simulation HDL (VHDL) with gmp API calls as proof-of-concept.

In order to support for the basic GMP (multi-precision integer) API/DSL for just addition and multiplication, the following additions were needed:

A design approach that is widely used by HLS (high-level synthesis) tools but is not really advertised loud and proud to HLS users is the Finite-State Machine with Datapath, aka FSMD. For instance, the Wikipedia entry on FSMDs is really sketchy. FSMDs are the primary approach for dealing with generic/control-flow dominated codes in an HLS context.

An FSMD is a microarchitectural paradigm for implementing non-programmable/ hardwired processors with their control unit and datapath combined/merged. In FSMDs, the datapath actions are embedded within the actual next state and output logic decoder of your FSM description. From an RTL (Register Transfer Level) abstraction point you can view an FSM as comprising of:

a current state logic process for updating state and register storage

a next state logic process for calculating the subsequent state to transition

HercuLeS’ FSMDs are based on Prof. Gajski’s and Pong P. Chu’s work, mostly on some of their books and published papers. When I had started my work on HercuLeS, I had rented a couple of Gajski’s books from the local library and had actually bought two of P.P. Chu’s works; the RTL Hardware Design using VHDL book is highly relevant. Gajski’s work on SpecC and the classic TRs (technical reports such as Modeling Custom Hardware in VHDL) from his group were at some point night (by the bed) and day (by the desk) readings…

I believe Vivado HLS (aka AutoESL/xPilot) and the others do the same thing, following a very similar approach, with one key difference on how the actual RTL FSMD code is presented. Their datapath code is implemented with concurrent assignments and there are lots of control and status signals going in and out of the next state logic decoder. On the contrary I prefer datapath actions embedded within state decoding; produces a little slower and marginally larger hardware overall, but the user’s intention in the RTL is much more clear and it is to grasp and follow.

In an FSMD, the key notion is understanding how the _reg and _next signals work as they represent a register, i.e. its currently accessible value and the value that is going to be written into that register. Essentially _reg and _next is what you can see if probing the register’s output and input port at any time.

If following the basic principles from Pong P. Chu, every register is associated to a _reg and a _next signal. Some advice:

Have a _reg and _next version for each register as declared signals in VHDL code.

In each state, read all needed _reg signals and assign all needed _next ones.

Donnot reassign the same _next version of a register within a single FSMD state.

You can totally avoid variables in your code. Not all tools provide equally mature support for synthesizing code with variables.

Operation chaining is possible but requires that you write _next versions and read them in the same state. Then these are plain wires and donnot implement registers. Again, you can’t peruse (for writing) the same _next version more than once in the same state.

YARDstick is a design automation tool for custom processor development flows that focuces on the hard part: generating and evaluating application-specific hardware extensions. YARDstick is a powerful building block for ASIP (Application-Specific Instruction-set Processor) development, since it integrates application analysis, ultra-fast algorithms for custom instruction generation and selection with user-defined compiler intermediate representations. It integrates retargetable compiler features for the targeted IRs/architectures.

Features

Important features of YARDstick are as follows:

retargetable to used-defined IRs by machine description

can be targeted to low-level compiler IRs, assembly-level representations of
virtual machines, or assembly code for existing processors

YARDstick comes along with a cross-platform GUI written in Tcl/Tk 8.5 (http://wiki.tcl.tk).

Aim

The ultimate goal of YARDstick is to liberate the designer’s development infrastructure from compiler and simulator idiosyncrasies. With YARDstick, the ASIP designer is empowered
with the freedom of specifying the target architecture of choice and adding new implementations of analyses and custom instruction generation/selection methods.

Typically, 2x to 15x speedups for benchmark applications (ANSI C optimized source code) can be fully automatically obtained by using YARDstick depending on the target architecture. Speedups are evaluated against a scalar RISC architecture.

[I was an early PhD student at the time and was quite impressed with ArchC had to offer; I still am]

What ArchC is all about

ArchC is an architecture description language that exploits the already mature SystemC legacy. It targets the automated generation of a) functional and cycle-accurate instruction set simulators and at a later stage of b) tools from the entire application development toolchain for a custom processor (compiler, assembler, linker, debugger, etc).

Simulator generation is at a very good and usable state, however needs to be extended towards the system-level (bus generation ?). Toolset generation which naturally comes next, requires more hard effort and therefore the appropriate support of the open-source community. In my personal opinion, compiler generation is so hard that it can only be possible through funding. Other issues as automated bus generation are also very HARD and need support.

Strengths of ArchC

Important points regarding ArchC are:

Nice integration with SystemC. The ArchC user codes instruction decoding, hardware resources, etc in small ArchC files and the corresponding SystemC are automatically generated, which removes heavy burden from the user.

Unique approach to the design of a retargetable system call library. This makes possible to benchmark real-sized applications with realistic demands, i.e. many input arguments. Many of the other simulators can’t cope with that.

How has ArchC helped me in my research

Ease of designing variations of a base processor. Most of my papers require tradeoff analysis on alternative architectures.

My “drproc” model is a 5-pipeline stage RISC processor with specialized instructions for data-reuse transformations [work of Catthoor et al, IMEC]. I originally had coded an assembler (~few days) and the entire processor in plain SystemC (~month). When I found out about ArchC, it required only 3-4 8-hour days to deliver to my colleagues a much more robust cycle-accurate simulator!

I have ran reasonably sized applications (e.g. one sourceforge application, of 25K C lines with intensive use of pointers) with no problem on the R3000 and MIPS models of ArchC.

What needs to be done

NOTE: In the most part, the roadmap has been completed!

This work will be incomplete unless the “ArchC roadmap” is fulfilled. While I believe the compiler retargeting is a very hard issue, [addressing UNICAMP team] I am certain that your team is experienced in DSP compilation techniques (I have read some of the papers). Certainly, assembler generation, and bus wrapper generation (e.g. for OCP) plus some more models are awaited by many users, including myself.

This free/demo version has generated substantial user feedback. Dear users, wherever you are located (US, Canada, Japan, P.R. China, Sweden, Germany, UK, India, Brazil, the list is not conclusive by any means) thank you-thank you-thank you!!!

I have been compiling a short list of (let’s say) the “Top 12” of user feedback patterns. Focusing on the more generic points, I am disclosing the list as food-for-thought and to generate even more feedback!

Transparent interface to the logic synthesis tool and up to downloading the bitstream. (The prototype version works and has been checked for my development boards.)

Plug-in approach to interconnecting legacy reusable designs with HLS-generated designs (“I have this piece of code in idiosyncratic C flavor or assembly or FORTRAN…”). Looks like a call for implementing point solutions for select (aka paying) users.

Pthread or OpenMP frontend support. Explicit parallelism if you please! (My bet is with parallelism extraction but I get this important point)

VHDL frontend support (!) for behavioral VHDL to netlist HDL end-to-end flow. (There exist hard-to-the-core developers — and greatness is with them — that do their algorithmic exploration in behavioral VHDL. I usually do my own in C or VHDL and only lately I have been increasingly using MATLAB and Processing.)

Can you synthesize malloc and free? (Yes I can, and I am improving it, since until now I had been *intercepting* malloc and free and mapping them to a hardware-managed stack.)

There exists a saying about pretty much every new disruptive technology and so it goes for high-level synthesis. Maybe you have heard it before reading these lines; as an ingredient to a pitch, on colleagues’ coffee break, at a press conference or maybe you have found yourself thinking more or less of this: “When I become proficient with this XYZ HLS tool, I will be 10x more productive! I will be able to fast-forward this task, put a smile on my supervisor’s face; earn a good deal of quality time too!”

Well this reads as a dream-come-true, but is it reasonable to believe in it? Is it already possible, or will it become possible? Being an order of magnitude more productive; is it myth, legend or a factual thing?

HLS attempts to solve the tedious iterations at the RTL (primarily). This is done by having the developer describe the intent or behavior of the system using increased abstraction. No need to reason for each FSM state, exact timing, and taking every stepwise error-prone decision. Ideally, by giving the algorithm and a set of requirements to the HLS tool, you can effectively get a correct design with the desired QoR. Or get a no-go.

To add some data into the soup, let’s examine the following figure.

For these ten algorithms, ANSI C code is passed to HercuLeS HLS ( http://www.nkavvadias.com/hercules/ ). At the end of the flow, other things apart, we get the synthesizable RTL VHDL for the given C code. By measuring the VHDL-to-C ratio of useful lines of code (no comments etc), a 7.8x to 25.15x variation with an average of about 15 times (15.4x) is obtained.

This is by no means conclusive but it is indicative. HercuLeS takes your code through a series of transformations; from C to N-Address Code (intermediate language) to graph-based representation (in Graphviz) and finally to RTL VHDL. Every step is from a higher abstraction level to a lower one and every step results in a more detailed representation as it moves from software to hardware description.

This approach sounds obviously the correct one. But bare in mind the potential fallacy: 10x less code to write does not necessarily yield 10x less time to develop it. What must also be accounted for are:

Iterations in developing the C code itself.

Multiplicity in the number of generated RTL designs, as points in the architectural design space. These iterations are sloooow, so you have to keep them a) few, b) short. For this reason, it is much better that they should not happen at the RTL, but rather at the level of an IL (intermediate language) estimator/profiler like aprof: http://www.nkavvadias.com/doc/aprof/aprof-README.html .

RTL design is only a fraction of concept-to-implementation. As cliche as it may read, this is Amdahl alright.

Coding for parallelism, whether implicit or explicit, is challenging in the sense of near-optimal scaling for regular or irregular applications and adequacy of the underlying hardware paradigm.

And the verdict: tenfold increase in productivity is a FACT, and you should be able to experience it, if you do it the RIGHT way!

As a followup to a previous blog post on out-of-the-box Vivado HLS vs HercuLeS comparison the following table provides updated information on the performance of HercuLeS against Vivado HLS 2013.2 on Virtex-6 and Kintex-7 (XC7K70TFBG676-2 FPGA device).

Better results (lower execution time; smaller area) have been typeset in bold. It can be clearly seen that HercuLeS outperforms Vivado HLS in key benchmarks such as filtering and numerical processing. As expected in many occasions, better speed/performance can be traded-off for lower area. With 12 partial wins each, one could call this a tie

Benchmark

Description

Vivado HLS (VHLS)

HercuLeS

Device

LUTs

Regs

TET (ns)

LUTs

Regs

TET (ns)

1

bitrev

Bit reversal

67

39

72.0

42

40

11.6

Virtex-6

2

divider

Radix-2 division

218

226

63.6

318

332

30.6

Kintex-7

3

edgedet

Edge detection

246

130

1636.3

680

361

1606.4

Virtex-6; 1 BRAM for VHLS

4

fibo

Fibonacci series

138

131

60.2

137

197

102.7

Virtex-6

5

fir

FIR filter

89

114

1027.1

606

540

393.8

Kintex-7

6

gcd

Greatest common divisor

210

98

35.2

128

93

75.9

Virtex-6

7

icbrt

Cubic root approximation

239

207

260.6

365

201

400.5

Virtex-6

8

sieve

Prime sieve of Eratosthenes

525

595

6108.4

565

523

3869.5

Virtex-6; 1 BRAM for VHLS

NOTES:

TET is Total Execution Time in ns.

VHLS is a shortened form for Vivado HLS.

Vivado HLS 2013.2 was used.

Bold denotes smaller area and lower execution time.

Italic denotes an inconclusive comparison.

For the cases of edgedet and sieve, VHLS identifies a BRAM; HercuLeS does not. In these cases, HercuLeS saves a BRAM while VHLS saves on LUTs and FFs (Registers).

HercuLeS is an extensible high-level synthesis environment for automatically mapping algorithms to hardware. Essentially, HercuLeS translates programs in a typed-assembly language named N-Address Code (NAC) to a collection of Graphviz CDFGs (Control-Data Flow Graphs) which are then synthesized to vendor-independent self-contained RTL VHDL. HercuLeS is also used for push-button synthesis of ANSI C code to VHDL.

Overview

The basic steps in the HercuLeS flow are shown in the figure below. C code is passed to GCC for GIMPLE dump generation, optionally following an external source-level optimizer. Textual GIMPLE is then processed by gimple2nac; alternatively the user should directly supply a NAC translation unit or use an owned frontend. Alternative frontends which are currently WIP are being developed for LLVM (C, C++, Objective-C) and domain-specific languages (DSLs).

HercuLeS overview.

Various optimizations can be applied at the NAC level; peephole transformations, if-conversion, and function call insertion to enable IP integration. Heuristic basic block partitioning avoids the introduction of excessive critical paths due to operation chaining. The core of HercuLeS comprises of a frontend (nac2cdfg) and a graph-based backend (cdfg2hdl). nac2cdfg is a translator from NAC to flat CDFGs represented in Graphviz. cdfg2hdl is the actual synthesis kernel for automatic FSMD hardware from Graphviz CDFGs to VHDL and self-checking testbench generation.

nac2cdfg is used for parsing, analysis and CDFG extraction from NAC programs. cdfg2hdl maps CDFGs to extended FSMDs (Finite-State Machines with Datapath). An ANSI C backend allows for rapid algorithm prototyping and NAC verification. VHDL code can be simulated with GHDL and Modelsim and synthesized in Xilinx XST and Vivado Design Suite using automatically generated scripts.

Features

HercuLeS supports a wealth of features. The following graphic provides a visualization of the most important existing or planned features.

HercuLeS features.

Third-party IP integration

Third-party hardware IP blocks can be seamlessly intergrated to HercuLeS. HercuLeS allows for automatic IP integration given that the user supplies builtin functionalities. The HercuLeS flow user is able to import and use an owned IP by the following process:

Implement IP with expected interface and place in proper subdirectory.

Add corresponding entry in a textual database.

Use TXL transformations for replacing an operator use by a black-box function call via a script.

A list of black box functions is generated.

HercuLeS automatically creates a hierarchical FSMD with the requested callee(s).

The following graphic illustrates the combined TXL/C approach. The first two steps apply preprocessing for splitting local variable declarations and removing those that are redundant or unused. Then, they are localized and subsequently procedure calls to black-box functions are introduced. These routines are the actual builtin functions. If the corresponding builtins are listed in the IP database, an interface-compatible VHDL implementation to HercuLeS caller FSMDs is assumed. Then, cdfg2hdl automatically handles interface generation and component instantiation in the HDL description for the caller FSMD description. In addition, simulation and synthesis scripts already account for the IP HDL files.

Automatic IP integration in HercuLeS.

This approach is also valid for floating-point computation, while both pipelined and multi-cycle third-party components are supported.

Summary

HercuLeS delivers a contemporary HLS environment that can be comfortably used for algorithm acceleration by predominantly software-oriented engineers. In this article, we only covered a high-level overview of how HercuLeS work, walked through existing and planned features and briefly discussed automatic hardware IP integration.

The latest ITRS tables (for 2011-2012 since the 2013 tables are not yet complete: http://www.itrs.net/reports.html) highlight three points that are important to anyone developing or using HLS/ESL solutions.

The responsibility for power reduction efforts will be shifted to higher abstractions.

The following figure illustrates this expected trend: behavioral and architectural level requirements will rise from a 50% to an 80% contribution to overall system power reduction.

Power reduction effort for the 2020 Digital Platform.

High-level synthesis estimates regarding speed, area, power and costs will be needed to be extremely accurate in the near future.

The quality of estimations produced during the phase of architectural design space exploration is projected rise to an approximate 100% within less than 10 years! The tradeoff of better-vs-faster estimation will be even more evident.

Expected high-level estimates accuracy for the 2020 Digital Platform.

The 2020 Digital Platform will be fully supported by development toolchains.

This means that both programmable and configurable aspects of the platform (such as hardware accelerators) will be accessible via convenient layers of programmability. The current shift towards parallelism-aware languages including C/OpenMP, OpenCL, CUDA, AMP++, and OpenACC is clearly visible and a vibrant reality among programmers. The following figure indicates that within less than 10 years ALL computational platforms from the HPC realm
to the autonomous, omnipresent, embedded systems will require full support by accessible tools.

I would like to present you with a formula for quotient calculation that works for any constant divisor of the form 2^n-1 (^ denotes exponentiation). For n=2,3,4,5,6,7, these divisors are 3,7,15,31,63,127. I have tested all divisors with n up to 16 and the approach works. My formula works only for dividends x that are in the range [0,(2^(2*n))-2].

The formula is as follows and it is both divisionless and multiplierless (as initially derived):

y = (((x>>n)+x+(1<<n)+1)>>n)-1;

which can also be written as:

y = (x + (x>>n) + 1) >> n;

This is a formula for software integer division, i.e. truncating division that calculates the quotient of the division. Please note that 1<<n is the strength-reduced form for calculating 2^n. It does not require any double-word arithmetic (like with multiplying by the multiplicative inverse and then performing adjustment steps).

E.g. for n = 6, the divisor is 2^n – 1 = 63. The formula will work for x <= 4094 = (2^(2*n)) – 2 = (1<<n)*(1<<n)-2. It will produce one-off errors for certain x >= (2^(2*n))-2. But it will be correct for the specified “small” x values.

I had devised the hack via “algorithmic sketching”. I mean that I had started from a basic formula with some “holes” in it. The holes where small constants or operators. There exist such techniques in academia, put more nicely; sometimes called algorithmic sketching, combinatorial
sketching, and it is relevant to forms of program synthesis. Essentially it is about devising algorithmic sketches (incomplete algorithms with wildcards) and then using a mechanism to fill the gaps.

Seems I can claim the specific hack as back as May 16, 2011.

Special thanks to everyone who has participated in the following threads. The Taylor series connection is very interesting, as mentioned by Tim Wescott and Tauno Voipio.