But the obvious answers are: pipelined harvard architecture with multiple busses, hardware MAC units capable of doing one or more macs in a single instruction, MOD addressing for circular buffers, minimal use of caches, bit reversed addressing for FFTs, hardware looping to reduce overhead