Ways to speed up code

Are there any ways to speed up code within a while loop? For example, I am running a Kalman filter in a continuous loop and a WAG for loop time is maybe 50 Hz. If I wanted the loop to run at 200 Hz, what are some ways to achieve this?

As David said, it would be helpful to see what you're trying to do in the loop. In general, you'll need to convert the loop to assembly to get a 4x improvement or better. If your program is in Spin, it's a little trickier since you can't put assembly code in-line. If your program is in C, the compiler can mix C and assembly easily.

Sometimes you can get a smaller improvement by optimizing the code, eliminating redundant subscripting or other operations, changing the order of calculations, etc. You won't see a 4x change there in most cases.

The C/C++ compiler produces faster and larger code than the Spin compiler. If you're using Spin, you can use "spin2cpp" to translate the Spin parts to C++, then do your tight loop directly in C++ and compile the whole thing using SimpleIDE.

As David said, it would be helpful to see what you're trying to do in the loop. In general, you'll need to convert the loop to assembly to get a 4x improvement or better. If your program is in Spin, it's a little trickier since you can't put assembly code in-line. If your program is in C, the compiler can mix C and assembly easily.

If your program is in Spin you could try spin2cpp to make it run faster.

Thank you for the suggestions. I will post my code when I get home this evening. It is written in C. It sounds like I should convert to assembly for a dramatic increase in speed. I am looking forward to learning how to do this. Is there a way to know for sure the loop time? I know Arduino has millis but haven't seen that function on the Propeller.

Thank you for the suggestions. I will post my code when I get home this evening. It is written in C. It sounds like I should convert to assembly for a dramatic increase in speed. I am looking forward to learning how to do this. Is there a way to know for sure the loop time? I know Arduino has millis but haven't seen that function on the Propeller.

David

You can get some performance improvement by selecting the LMM memory model instead of CMM. However, that will make your program a lot bigger so if it's already large it may not fit in hub memory. The other thing you can do is write some of the code using the COG memory model. That code runs directly in COG memory and is nearly as fast as PASM. Finally, you can just write in PASM and get the maximum performance.

Take your inner loop (of the filter) and make a separate function out of it. Compile that using the COG memory model. Compile the other parts of the program using the CMM model. This way you don't need to convert the inner loop code into assembler. Let the compiler do the work.

Take your inner loop (of the filter) and make a separate function out of it. Compile that using the COG memory model. Compile the other parts of the program using the CMM model. This way you don't need to convert the inner loop code into assembler. Let the compiler do the work.

Mike,

I'm sorry for my ignorance but how would I make a separate function? Your idea sounds great, I just don't know how to implement.

The biggest performance killer in your loop is probably the use of floating point. If you can change it to fixed point (use integers for everything, perhaps scaled by some amount) the performance will likely improve quite a bit. I'd definitely address that before trying to do anything in another cog.

Also you are wasting a lot of time addressing individual registers when you could send out the $28 address with bit6 set to indicate autoincrement and then read three pairs of SPI reads before deselecting. I was a bit confused about your scaling because you have .0175*PI/180 = 0.000305432 but then again I'm not familiar with gyros and what you are trying to achieve.
In Tachyon Forth I use integer maths for speed and the simple way to multiply a number by this value is to perform a mixed multiply and division operation with a 64-bit intermediate using the */ in Forth with 305432 1000000000 */ so that only takes 38.4us.

Out of curiosity though I converted part of your code into Tachyon, since I like to see if there is anything I can improve in Tachyon. Here are two subroutines and the start of the inner loop:

The biggest performance killer in your loop is probably the use of floating point. If you can change it to fixed point (use integers for everything, perhaps scaled by some amount) the performance will likely improve quite a bit. I'd definitely address that before trying to do anything in another cog.

This is something I can do relatively easily. I will try this and report back with results. Thanks.

I am a relatively new to programming and using the Propeller so it will take me some time to digest what you have said. One thing I believe you are saying is to reduce the mathematical operations in my loop by hard coding values. This is something I can do pretty easily. The .0175 is a scale factor from the datasheet and I was converting the data from deg/sec (outputted from the sensor) to radians/sec. All of the calculations in the filter are done in radians.

The biggest performance killer in your loop is probably the use of floating point. If you can change it to fixed point (use integers for everything, perhaps scaled by some amount) the performance will likely improve quite a bit. I'd definitely address that before trying to do anything in another cog.

This is something I can do relatively easily. I will try this and report back with results. Thanks.

As well as using fixed point, simple things like splitting the serial code, into a separate COG, removes that dead time (and use burst access, as suggested above)
eg Read over i2c is not super fast, and burst 100kHz read of 3x16 with some overhead, can consume ~ 20% of your targeted 5ms budget. 400kHz i2c is better but still ~5%

You can also use shift and add, over multiply.
A sensor 'gain' example I find is defined within a band of 3.5(min) < 3.9(typ) < 4.3(max) - roughly 5%, with 10b precision

Taking your example of
.0175*PI/180 = 0.000305432
you can approximate that with fixed point and shifts.
eg if we take
MF=(10000*.0175*pi/180)
MF = 3.0543261 - that does not prune LSBs in fixed point, and takes the 10b values expanding 0~1024 to 0~3072
a simple << 1 and add gives ((2+1)-MF)/MF = 1.77866% gain error, & that could already be good enough, given sensor is 5% gain accurate.
or, you can improve more
((2+1+1/16-1/128)-MF)/MF = 1.182941e-4 (0.0183%) or one part in 8453 gain error, way better than the sensor.

I am a relatively new to programming and using the Propeller so it will take me some time to digest what you have said. One thing I believe you are saying is to reduce the mathematical operations in my loop by hard coding values. This is something I can do pretty easily. The .0175 is a scale factor from the datasheet and I was converting the data from deg/sec (outputted from the sensor) to radians/sec. All of the calculations in the filter are done in radians.

David

The compiler should be reducing the expression to a constant anyway but it is still in floating point which is relatively slow since it has to be done in software whereas an integer operation using scaling before division is much faster. However your SPI routines are extremely inefficient and the compiler cannot optimize that for you but you can if you write %01101000 so as to address register $28 with the autoincrement bit set (bit 6) and thereafter simply read 2 bytes as X, 2 bytes as Y, 2 bytes as Z, then deselect chip-select. Much faster since SPI operations are also done in software. Alternatively you could have another cog continuously reading the gyro and updating global variables so that your loop doesn't have to wait for SPI.

I like ersmith's suggestion of removing floating point numbers for two reasons: 1) I agree that it should have a significant affect on performance and 2) it's something you know how to do right away. I'd start there and see how far that gets you.

And build with lmm instead of cmm, that's a really easy one if you are not already.

Don't send 7 bits followed by 1 bit. Just send all 8 bits at once and avoid a lot of overhead related to calling the same function twice.

// Add PWM control for motors
pwm_start(255);

Do this outside of your loop

Don't forget to comment out that last print line in the loop when you're testing the timing. The print function is severely limited by your UART speed.

I'm going to ignore the huge blocks of math - it just isn't my forte. Maybe there is room for improvement, maybe not.

Okay, so you're compiling with lmm, you've fixed the floating point to be fixed point, and you've combined your redundant shift_out calls, and you moved pwm_start, and you commented out the last print line and your math algorithms are as good as they can be.... but it still isn't fast enough. What's next?

These functions from simpletools.h are pretty slow. They're nice and easy to understand, and it's great that they come built in with SimpleIDE, but they are slow. There are faster options out there - namely PropWare - but it will require more effort to convert your code base, so let's save it for a last resort.
I'm guessing based on the code comments that you might be using an L3G gyroscope, so here's PropWare's demo code for communication with an L3G: https://david.zemon.name/PropWare/api-develop/L3G_Demo_8cpp-example.xhtml

The sine table in the Prop comprises 2049 words in ROM, from $E000 to $F001. To use it you have to scale your input to 13 bits, to cover 360 degrees. Use 11 of those bits for the table lookup, and two bits to put it in the correct quadrant. The floating point object uses the table with interpolation for trig functions, and either CORDIC or a Taylor series for the inverse trig functions.

If all your values fit in a certain range, they can be expressed as "fixed point" rather than floating point. For example, if you know that x and y are between -32767 and +32767, they can be stored as "16.16 fixed point". Basically you multiply all of the values by 0x00010000. So for example 0.5 is stored as "0.5 * 0x010000", or "0x8000". Then addition and subtraction work the same, but when multiplying you have to shift the final result by 16 (to do "x = a * b" you have to do something like "x = (a * (int64_t)b) >> 16". This is a lot more work for the programmer, but it is enormously faster on the computer than floating point. Google "fixed point arithmetic" for more details.

The sin/cos table in the Propeller ROM is stored as fixed point, I think with 15 bits after the decimal point.

Another way of speeding up your code might be to get rid of the pause() statements and re-arrange things so that the computations are taking place when you would otherwise be paused. You'll still have to make sure that there is a delay there, perhaps by using waitcnt(). I'm not sure what "pause(1)" does, but if it waits for a millisecond you could replace it with:

uint32_t waituntil = CNT + _clkfreq / 1000; // calculate time we have to wait until
...
// do some calculations here
...
// now pause until at least 1 milisecond has elapsed since the calculations started
while ((int)(CNT - waituntil) < 0)
;

(1) Try compiling with different options. As I think David said above, LMM is faster than CMM. The -O3 option is faster than the usual -Os option, although it may make even bigger code.

(2) Make sure you use -m32bit-doubles as an option.

(3) Try to avoid computing the same value more than once. Instead of calling cos(theta) in many places, create variables like "float ct, st;". Every time theta changes (not very often in your code, recalculate "ct = cos(theta); st = sin(theta)", and use "ct" and "st" in all the expressions in place of "cos(theta)" and "sin(theta)". At optimization level -O3 the compiler may be able to do this for you, but it doesn't hurt to try to help it out.

I'm just not understanding how I can use ints when many of my values are floating points. When multiplying by 180/pi, for example, it does not work as an int. What am I missing here?

Thanks.

You need to also scale, which is what fixed-point means. - and you need to take more care with overflow/underflow.

Taking the example above, you read an integer value 1..1023, and multiply by .0175*PI/180 = 0.000305432
That gives you floating point numbers from 0.000305432 to 0.31245756 at that stage, but many of those digits are a mirage, as the gain error is ~ 5% and the reading quanta is ~ 0.1%

If instead you do this MF=(10000*.0175*pi/180)
Now you create = 3.0543261 to 3124.575, or as integers, that can be 3~3125 That *10000 has moved the decimal point a fixed amount.

Because that scale is a known constant, you can even speed the Multiply here, by doing shifts and adds.
eg a very fast (3*N+N/16-N/128) has 0.0183% gain error. Checking with : N=1023; (N+N+N+(N>>4)-(N>>7)) = 3125

From there, you can further multiply and divide, & you need to check multiply does not overflow 32b and divides do not truncate too many bits.
That's why it's useful to litmus-check against what you start with, which here is ~10 bit readings.

In some cases, a 32*32 -> 64, then 64/32 mathop is useful. (Peter mentions a 64-bit intermediate above)

So yes, it is more work, but if you need a significant jump in speed, effort is needed

I see 180/PI and PI/180 in the snippet that you posted above. Am I jumping to the wrong conclusion to think that PI=180 degrees, eh, 180/PI = PI/180 = 1?

Fractions in fixed point are often dealt with as rational values or approximations. For example, the value 0.0175 * 180 can be dealt with ad hoc as 7/400 * 180000, then arranged to do the multiplication before the division and scaled up like that to maintain 3 digits precision.

In Spin, general precision can be had using the ** operator, high multiply, 180000 ** 75161928. That is implicitly 180000 * 75161928/(2^32). The divide by 2^32 comes from an implicit shift right by 32 bits after the multiply into 64 bits.

In any case, all the scaling and keeping track of the decimal point can be a bear in a complex problem and will make you appreciate floating point, which takes care of all that (until it doesn't!)

I'm jumping in without reading much ... David (amalfi),
The trick is getting your head around fixed-point numbers. You have to convert everything over. You choose a number of decimal places, or bits, to be used for fractional values.

Convert the constants to suit this: In the case of 180/pi = 57.295779513: If choosing to use 4 decimal places in fixed point decimal then this becomes an integer of 572958, which stands in for 57.2958. If done in, say, 16.16 fixed point binary, it would be $0039 4bb8, which stands in for 57.295776367.

Be wary that you are now effectively using integers, which means all calculations will naturally truncate rounding. For positives, this rounds toward zero. For negatives, this rounds toward negative infinity I think.

Using the same format as the sine tables is sensible. So, not really your choice now.

PS: At some point you'll likely have dynamic range issues. Cross that hurdle when you get there.

PPS: Lol, I see everyone has answered. I'm gonna leave this one posted anyway.

"There's no huge amount of massive material
hidden in the rings that we can't see,
the rings are almost pure ice."