Hardware acceleration is advantageous for performance, and practical when the functions are fixed so updates are not as needed as in software solutions. With the advent of reprogrammablelogic devices such as FPGAs, the restriction of hardware acceleration to fully fixed algorithms is easing in the 2010s decade, allowing hardware acceleration to be applied to problem domains requiring modification to algorithms and processing control flow.[5][6][7]

Hardware execution units do not in general rely on the von Neumann or modified Harvard architectures and do not need to perform the instruction fetch and decode steps of an instruction cycle and incur their overhead. If needed calculations are specified in a register transfer level hardware design, the time and circuit area costs of the instruction fetch and decoding stages can be reclaimed. This saves time, power and circuit area, allowing the reclaimed resources to be used for increased parallelism, possibly for multiple functions, as well as increased input/output capabilities; at the opportunity cost of less general-purpose utility. Additionally, greater RTL customization of hardware designs allows emerging architectures such as in-memory computing, transport triggered architectures (TTA) and networks-on-chip (NoC) to further benefit from increased locality of data to execution context, thereby reducing computing and communication latency between modules and functional units.

Suppose we wish to compute the sum of 220=1,048,576{\displaystyle 2^{20}=1,048,576}integers. Assuming large integers are available as bignumlarge enough to hold the sum, this can be done in software by specifying

Traditionally, processors were sequential (instructions are executed one by one), and were designed to run general purpose algorithms controlled by instruction fetch (for example moving temporary results to and from a register file). Hardware accelerators improve the execution of a specific algorithm by allowing greater concurrency, having specific datapaths for its temporary variables, and reducing the overhead of instruction control in the fetch-decode-execute cycle. Modern processors are multi-core and often feature parallel SIMD units; however hardware acceleration still yields benefits. Hardware acceleration is suitable for any computation-intensive algorithm which is executed frequently. Depending upon the granularity, hardware acceleration can vary from a small functional unit, to a large functional block (like motion estimation in MPEG-2).