Wednesday, August 12, 2009

So I've had to write a very optimized processing block. This seemed pretty daunting so I chose to ignore performance and begin with a very structured hand-written state machine. Got the process working and now onto the optimization.

Taking a state machine that is well written and making it fully pipelined is very easy! I can't give a wizard like step-by-step process to do this, but some simple ideas come to mind. If the process can be optimized then that means that your states must not be exclusive. Ie: some of the processing steps will run in parallel with other steps. In order to create such a situation you want to take code out of your state machine and put it in separate flag enabled blocks. Use these flags to control which steps run when. Sometimes you'll find that many of the steps can run without the need for flags and all that's important is some process start flag or end flag.

I have written a very simple example module to show how to do this. read_enable takes one clock to bring data in. The hardware_manipulation block takes 2 clocks to process and send data out.

When you look at the state machine you will see these steps:0. request data (read_enable <= 1)1. wait one clock while data is being retrieved2. put received data into hardware_manipulation block (data_in <= data)3. wait one clock while hardware_manipulation block is calculating4. wait one more clock while hardware_manipulation block is outputting data5. put hardware manipulated data out to result (result <= data_out) and go back to start

This complete process takes 6 clocks. By using the optimized version you can see that this process takes 6 steps to complete but runs 3x faster b/c it's restarted every 2 clocks.

When you look at the optimized version you will see these steps:0. request data (read_enable <= 1)1. wait one clock while data is being retrieved2. put received data into hardware_manipulation block (data_in <= data) AND in parallel start from 0 again (ie request data again)3. wait one clock while hardware_manipulation block is calculating4. wait one more clock while hardware_manipulation block is outputting data5. put hardware manipulated data out to result (result <= data_out)

Once the stages get going you will have stage 0 running with stages 2 and 4, and stage 1 running with stages 3 and 5.