while I had to mess with RL78-GCC, AVR-GCC and ARM-GCC compiler settings extensively to be able to get it to toggle a GPIO pin efficiently,

Tell us more about the extensive settings needed for avr-gcc when toggling a pin? With nothin but -Os on the command line a write to PIN will be an SBI and for an older school AVR that requires an RMW it will be 3 opcodes. How could it be more efficient than this and that is with nothing more than -Os ?? (which most IDE and makefiles pass as the default anyway)

I'm not necessarily talking about using an IDE anyway. If I sit at a command line and type only "avr-gcc -mmcu=at<something> -Os avr.c -o avr.elf" that will build a program that uses "tight" toggling code so I'm still a little perplexed by the extensive settings claim. perhaps that original "joke" about GCC optimisation wasn't a joke at all and OP simply does not understand how to drive GCC? While it still cannot claim to be as good as IAR I would say the code generation model is no as "tight" as just about any other choice of C compiler for AVR. Especially with G-J's changes going into v8.x

As for compilers, I would not call AVR-GCC "high quality" — I would call it completely average, when compared to other platforms and compilers I've tested so far. Without the optimizer on, it produces fairly mediocre code (a bit-set operation was compiled into 9 instructions, in my testing), and with the optimizer configured to do anything useful, the code is very difficult to debug — I'm often forced to use assembly breakpoints, as Atmel Studio can't seem to figure out what I'm trying to do.

Clearly rubbish or a joke but maybe not? OP doesn't seem to know how to operate GCC if he thinks it would ever be relevant to run without optimizer!!

Tell us more about the extensive settings needed for avr-gcc when toggling a pin?

With the optimizer off, GCC (on AVR, ARM, or RL78) doesn't seem to know what SFRs are — it treats everything as 16-bit memory, with multiple instructions for indirect accesses. No other compiler I've tested so far does this.

With the optimizer set to -O1 or -Os, AVR-GCC begins treating SFRs as SFRs (with "in" and "out" instructions), but GCC's RL78 backend did not — in fact, it *never* understood SFR memory. I tried many, many settings, to no avail.

With the optimizer turned on at all in AVR-GCC, many of my breakpoints stopped working, and temp variables are immediately compiled out. I tried many different combinations of "Debug level" and "Optimization level" settings in Atmel Studio, but I could never get perfect debugging along with actual register manipulation. Please help if you have suggestions — seriously! Maybe this is more the Atmel Studio debugger's fault instead of avr-gcc, but I'm sort of referring to the toolchain as a whole when talking about AVR-GCC (sorry if this offends you). At the end of the day, other compilers/toolchains didn't have this problem.

clawson wrote:

Clearly rubbish or a joke but maybe not? OP doesn't seem to know how to operate GCC if he thinks it would ever be relevant to run without optimizer!!

I have no problem turning on the optimizer, but I need basic breakpoints to work, too. Even when I have the optimizer cranked all the way up in most environments, breakpoints still work (though variable watches don't). I understand if a variable gets optimized out (that's fine), but to get basic breakpoints working, I end up having to read through the assembly listing (which is not as easily-accessible as in other IDEs), and set assembly breakpoints instead of C breakpoints.

None of this is the end of the world. I get that. GCC is perfectly fine — and I will gladly continue using it when working on AVR projects. But the whole point of this project is to compare what's out there, and compared to the other compilers I've tested on these different MCU platforms, I'd call it completely average. That was all I was saying.

If you want me to say something nice about AVR-GCC specifically, I will say that it's much better than the RL78's GCC implementation. Will you get off my back now?

I think you would be blown away by the Silicon Labs EFM8 stuff. Three-stage pipelined architecture, running up to 72 MHz. None of this old-school 12-cycle-machine-clock rubbish; this is a single-cycle machine that, clock-for-clock, matches the TinyAVR closely (not going to say more until I finish testing!)

I have some sitting on the bench back home and one of the first jobs when I get back from holiday is to run them up and do some tests.

They are certainly very different to the 12T 12MHz parts I first used last century. On raw MIPS they will be no slouch when compared to the AVR; I just wonder what they will be like in the real world when the lack of modern 'compiler-friendly' features starts to bite. The early design decisions made by Atmel, as documented in the PDF which gets linked to from time to time, along with their collaboration with the compiler writers has yielded a vert capable 8-bitter which seems remarkably unconstrained unlike some other chips.

#1 This forum helps those that help themselves

#2 All grounds are not created equal

#3 How have you proved that your chip is running at xxMHz?

#4 "If you think you need floating point to solve the problem then you don't understand the problem. If you really do need floating point then you have a problem you do not understand." - Heater's ex-boss

And it's a problem for a (good) compiler to make a breakpoint if your code don't exist any more because it don't do anything.

For small test programs make sure that make "key" variables volatile or something like that, if not GCC will not generate any code because it's not used for anything!

Thanks for the good tips, but I'm well aware of all of these -- and none of them address what I'm saying: AVR-GCC doesn't use register accesses when the optimizer off, but as soon as you switch the optimizer on any level at all, breakpoints can start being problematic.

I'm not in front of AVR Studio right now, but I believe I've got a pathological case to illustrate what I'm saying:

while(1) {
DDRB ^= 1;
}

With the optimizer off, that single line of code will get compiled to a single instruction to write the immediate value "1" to a register (used as the xor argument), followed by 3 or 4 instructions to do an indirect fetch from 16-bit memory, an xor operation, and another 3 or 4 instructions to do an indirect write back to 16-bit memory, followed by a jump. No "out" or "in" instructions will be present, even though we're obviously dealing with SFRs.

Alright, that's crap, so let's turn the optimizer on any setting -- -Og, -O1, whatever -- and if you recompile, those gross 4-instruction memory fetches turns into a single "in" instruction and a single "out" instruction. Atmel knows this is how you have to use AVR-GCC, so they make -O1 the default option. However, if I try to set a breakpoint on the DDRB toggle line, it will fire ONCE when the program starts, but never fire again. Why? Because they breakpoint is getting set on the single "load value 1" register call that happens outside the loop (since the optimizer is on!).

Again, I know perfectly well how to deal with this. You can go to the assembly view and set the break point on the "in" instruction. But this would be easier if AVR-GCC always used "in" and "out" instructions when doing register operations, even with the optimizer off. No other compiler considers these "optimizations"

That was the only point I was trying to make when I said "I find GCC completely average" when compared to everything else out there. It's certainly not "the best" and it's certainly not "the worst" -- and if you know how it works, you can use it to efficiently generate AVR code; but it takes a bit more "thinking" than other compilers do.

I just wonder what they will be like in the real world when the lack of modern 'compiler-friendly' features starts to bite. The early design decisions made by Atmel, as documented in the PDF which gets linked to from time to time, along with their collaboration with the compiler writers has yielded a vert capable 8-bitter which seems remarkably unconstrained unlike some other chips.

I think RISC cores with lots of registers were seen as the "modern, compiler-friendly" architecture when AVR designed it way back when, but -- and I'm not trying to start a flame war -- more CISC-ish cores seem like they ultimately came out ahead, since compilers started getting more and more advanced. It's much easier to write a compiler for AVR than for 8051, however, there are compilers that work equally well for both of them.

Note that there are a few... uhh... eccentricities that you have to deal with. For performance reasons, Keil passes parameters to functions using predefined registers, not stacks, so Keil will throw a warning if you call a function from within itself (though you can append "reentrant" to the function declaration to force Keil to use a different strategy for passing values that is safe for re-entrant functions).

I think the biggest one for beginners that even modern compilers don't try to solve for you is the RAM vs XRAM thing. A compiler can be instructed to assume all variables go in either RAM or XRAM, but you'll still find yourself specifying this manually. Or you can just put everything in XDATA and not care about squeezing out performance. My 16-bit signed biquad filter performance tests of the Nuvoton N76 (an 8051-derivative) produced I think 35 ksps when the buffers are in XRAM and 40 ksps when the buffers are in RAM. Huge difference, but it's not, like, you know TEN TIMES or something insane like that.

I agree AVR is a good architecture. But for it to be "unconstrained" when compared to, i.e., the EFM8 stuff, I'd like to see it with a 72 MHz core clock and an internal LDO to allow a much smaller process with a 1.8V core. Again, clock for clock, they're about the same -- but on AVR, you hit that 20 MHz speedbump pretty quickly (and that's assuming you want to drop a crystal into your design that could cost nearly half as much as the MCU itself!)

So yeah, there's other things at work than just the core architecture design.

for the 8051 the keil invented a 3 byte pointer, so all memory can be reached, don't they have that any more ?

to be ANSI C the code need to be reentrant, if not it's cheating, and my guess is that you can force the GCC to do the same.(I remember some 8051 code where I had to have some double library routines so main and ISR could do the same things (that was the BSO compiler))

a limit of 20 MHz , then there are all the xmegas with 32MHz but I guess that they start at around $2, and they have dma controller, faster ADC etc. (take a look at something like a ATXMEGA32E5 for $2.06 @100 at digikey), and the good thing is same tool. (and it run 32MHz from internal clk.)

about crystal for 20MHz yes it's sad, the chips used to be able to run 20MHz from a $0.15 crystal, but some of the newer chips don't do that :(

speed compared between 8051 and AVR, I would say that (single clk)8051 is in general faster(at same clk speed) if you can stay inside the 256 byte RAM, where the AVR don't have any penalty for more RAM, and there it's normally faster.

But all this said normally I used to say if it's only for the price, a AVR is only a contender if you need EEPROM, but for normal small numbers your developing speeds means more than the chip price.

And the good thing is that the same compiler/tool handle AVR's from 1/2 Kbyte flash upto 512Kbyte(perhaps there is a bigger out now)

for the 8051 the keil invented a 3 byte pointer, so all memory can be reached, don't they have that any more?

Yup, they do. I'm impressed you know that detail! It's called a "generic pointer" — Keil does automatic conversion between pointer memory spaces for you, but it obviously can require extra instructions. Functionally, though, it's completely transparent to the user.

sparrow2 wrote:

to be ANSI C the code need to be reentrant, if not it's cheating, and my guess is that you can force the GCC to do the same.

Yeah, like I said, Keil can generate reentrant-capable functions if you decorate the function declaration appropriately, but if a function doesn't need to be reentrant, you save a few cycles by leaving it default (non-reentrant)

sparrow2 wrote:

then there are all the xmegas with 32MHz but I guess that they start at around $2, and they have dma controller, faster ADC etc. (take a look at something like a ATXMEGA32E5 for $2.06 @100 at digikey), and the good thing is same tool. (and it run 32MHz from internal clk.)

Yeah! I'll probably buy an Xmega at some point to play with, but not for this review. I'm curious where you think they fit into the world, in 2017, with all the Cortex-M0 stuff Atmel is doing? The SAM D10, a 48 MHz modern part, is significantly cheaper than a mega168pb, and has similar capabilities.

sparrow2 wrote:

speed compared between 8051 and AVR, I would say that (single clk)8051 is in general faster(at same clk speed) if you can stay inside the 256 byte RAM, where the AVR don't have any penalty for more RAM, and there it's normally faster.

Yup, you got it. With Silicon Labs' pipelined cores, the number of clock cycles an instruction takes is simply equal to the number of bytes long the instruction is (minus conditional branches). You have essentially three levels of granularity on the 8051 -- registers, "scratchpad" RAM, and XRAM, so MOV and math operations can take 1, 2 or 3 clock cycles, depending what you're operating on (gross simplification, but useful way of thinking about things, in my opinion).

Yup, you got it. With Silicon Labs' pipelined cores, the number of clock cycles an instruction takes is simply equal to the number of bytes long the instruction is (minus conditional branches). You have essentially three levels of granularity on the 8051 -- registers, "scratchpad" RAM, and XRAM, so MOV and math operations can take 1, 2 or 3 clock cycles, depending what you're operating on (gross simplification, but useful way of thinking about things, in my opinion).

On a 8051 I would divide the internal RAM into two parts lower 128 byte and high 128 bytes (unless you reserve the high 128 as stack only)

speed compared between 8051 and AVR, I would say that (single clk)8051 is in general faster(at same clk speed) if you can stay inside the 256 byte RAM, where the AVR don't have any penalty for more RAM, and there it's normally faster.

Yup, you got it. With Silicon Labs' pipelined cores, the number of clock cycles an instruction takes is simply equal to the number of bytes long the instruction is (minus conditional branches). You have essentially three levels of granularity on the 8051 -- registers, "scratchpad" RAM, and XRAM, so MOV and math operations can take 1, 2 or 3 clock cycles, depending what you're operating on (gross simplification, but useful way of thinking about things, in my opinion).

There is some spread in the 'faster' bands.

8051 has boolean opcodes, interrupt priority and register bank switching, and can DJNZ on any DATA memory location - code that uses those features, benefits

AVR has some 16b-data opcodes and better pointer operations, so code that uses those can look better.

The biggest difference is AVR tops out at 16-20MHz at 5V, but lower MHz at lower Vcc. 8051 top out at 72MHz(LB1) at 3v, or 25~33MHz at 2.2~5.5V for other vendors.

The SiLabs series have what is effectively a fractional baud UART, even on the smallest parts, so peripherals can make a difference.

jaycarlson wrote:

With Silicon Labs' pipelined cores, the number of clock cycles an instruction takes is simply equal to the number of bytes long the instruction is (minus conditional branches).

Most 1T 8051's have at least some 1 byte 1 cycle opcodes, in the better ones nearly all 1 byte opcodes are 1 cycle.

The new STC8F makes quite a leap, into a 24b opcode fetch, which means all opcodes (1,2 or 3 byte) can have a 1 cycle base - eg bit/push/pop & mov dir,dir are now all 1 cycle instructions.

it's more important where Keil place local variables! if it's not a stack it can't be reentrant!

And as I said, Keil's C51 compiler does not generate reentrant functions unless you explicitly ask it to (by decorating the function with the "reentrant" keyword). Locals end up in registers, until Keil runs out of space, and then it starts using RAM.

So I will just say that isn't fair for the AVR, and to be harsh, that is like a C compiler competing with ASM written with C syntax.

I hear you. I wouldn't call it "fair" or "unfair" — just different strengths and weaknesses based on different design choices. I get what you're saying about the "competing with ASM written with C syntax" but it's really just that the developer needs to have more thorough understanding of the memory model of the platform, which you don't need for AVR. That's what made AVR look very elegant when it was introduced. For what it's worth, I've had to use reentrant functions precisely once in the three or four commercial projects I've done on 8051s, and it's easily accomplished by adding the "reentrant" keyword to your function. Keil will throw a warning (though not an error, oddly!) if you forget this. Annoying, but workable. Generally, you don't need to know what's going on under the hood, unless you really care.

When you declare global variables without decorating them, they'll go in whichever memory space is "default" for for your memory model. Keil's "small" model places variables in RAM by default, while the "large" model places variables in XRAM by default, freeing your precious 128-bytes of RAM for locals and other stuff you need to optimize a bit. You can always override where variables are stored with the "xdata" or "data" keywords (horrible, horrible keywords — to this day, I always do a double-take when I get a weird compiler error thrown by a "void myFunction(uint8_t data)" declaration).

This stuff doesn't bother me as much as the 128-byte SFR limit, which is pretty easy to hit on modern MCUs with tons of peripherals. Manufacturers often use paging (sort of like bank-select statements in PIC), but unlike Microchip's XC8 compiler, Keil doesn't automatically generate SFR page select instructions, so if your Timer1 starts acting up when you try to enable Timer5, chances are you forgot to switch pages. That's a huge trap for new guys that's really annoying. Other manufacturers do all sorts of weird stuff — STC is the biggest offender in the "strange hacks" realm: they maintain Timer0/Timer1 compatibility with classic 8051 MCUs, but they turn them into auto-reload ("period") timers by putting a "hidden" reload register, aliased with the timer's value register, that's only accessible when the timer is in 13-bit mode, and stopped. It's clever, but pretty gross. They also quickly ran out of SFRs with their 6-channel 16-bit arbitrary-phase PWM module (which has almost 30 registers for configuration!), so they just gave up and dumped the whole peripheral into the end of XRAM somewhere. Hey, what do you want for a buck?

By the way, these new Tinys are a substantial improvement over the previous-generation ones I was using. PDI (UPDI? Whats the difference?) feels like a normal debug interface now — no more switching back and forth between debugWIRE and weird ICSP mode to burn fuses. Also, it's nice to see much of the clock configuration done at run-time instead of with fuses. Really brings the platform in-line with other products.

I didn't realize how different the peripherals were, too? Even the GPIO port structure is completely different. Registers are grouped as offsets-from-base-addresses, just like how most ARM peripherals work, which I've never seen on an 8-bit MCU:

Interesting to see separate SET and CLR registers — I thought AVR always had a set-bit/clear-bit instruction, so I'm not sure why this was done? Anyone have insight into this? By the way, no, these are not preprocessor trickery — these are individual registers.

Sorry if this is super old news to everyone, but I think it's interesting to see such dramatic changes in a family, and I'd love to hear any background information, if anyone has any details?

I don't think SBI & CBI have a RMW issue (such as you'd see in the old PIC ports)...if you SBI a bit, that bit gets set to 1 & the others will be unaffected

I seem to remember some discussion regarding automatically clearing bits (such as IRQ flags) potentially being an issue, but can't find now. The new instructions let you set (or clear) multiple bits at once--a great convenience.

...I found this old comparison between pics & avr:

I/O
Seperate PORT and PIN registers avoid read-modify-write issues with capacitively loaded pins. (Although has any AVR user never spent time wondering why their input port isn't working because they used PORTx instead of PINx...? ).

I don't think SBI & CBI have a RMW issue (such as you'd see in the old PIC ports)...if you SBI a bit, that bit gets set to 1 & the others will be unaffected I seem to remember some discussion regarding automatically clearing bits (such as IRQ flags) potentially being an issue, but can't find now. The new instructions let you set (or clear) multiple bits at once--a great convenience.

I will just quote from a typical AVR datasheet:

Alternatively, ADIF is cleared by writing a logical one to the flag. Beware that if doing a Read-Modify-Write on ADCSRA, a pending interrupt can be disabled. This also applies if the SBI and CBI instructions are used.

(emphasis mine)

This means SBI/CBI read the whole byte, not just the one that is modified. This can have side effects, normally on interrupt flags.

It's even more insidious than simply the sequence "IN ... ORI ... OUT" being non-atomic. (In fact, the specific example that Boxbourne gave is actually inaccurate, since most interrupt flags actually cannot be cleared by accidentally writing a "0" to them...)

In the AVR architecture, 32 of the I/O registers are directly bit-addressable. You can set and clear bits, as well as do some level of conditional branching, based entirely on the state of individual bits within the first 32 I/O registers. Bit setting and clearing on these 32 registers can happen in a single, atomic instruction, using the SBI and CBI op-codes.

But on most AVR's (the devices which are exceptions to this rule have notes in the Register Summary section of the datasheet), even these so-called bitwise operations actually operate on the whole register, even if only one bit is being changed. In a single instruction cycle (two clocks in this case), the whole I/O register is read into a scratch space, a single bit is modified, and the whole scratch register is written out to the I/O register again.

If at the point that the register was read in, there was an interrupt flag set, then that flag will be copied into the scratch space. Then the single bit will be modified. Then, the whole scratch space, including the interrupt flag, will be copied back into the I/O register. Writing a '1' to an interrupt flag generally causes that flag to be cleared, and thus you lose an interrupt. Even if you did it atomically.

Some AVR's have "fixed" the SBI/CBI op-code so that they truly only operate on single bits without any possibility of affecting the surrounding bits within the register.

This means SBI/CBI read the whole byte, not just the one that is modified. This can have side effects, normally on interrupt flags.

In modern AVR, SBI/CBI touch only one bit in the target SFR. No other bits are affected. Anything introduced in the last 10 years. Older AVR cores were subject to RMW effects for the CBI/SBI instructions, including (I believe) the m16.

The quoted datasheet excerpt likely is a copy/paste error. I don't know if there's an authoritative list of which AVRs handle CBI/SBI as RMW. If there is one, I'd like to know.

"Experience is what enables you to recognise a mistake the second time you make it."

It does not have and instruction to set or clear an I/O bit but instead it has a 32 bit Set/Clear I/O registers, one for each 16 bit port.

If you write a 32 bit value to that port. Then all the zero's in that value leave the I/O bits unchanged. The one's in that 32-bit value either set or reset the corresponding bit on the corresponding output port.

So with a single asm instruction for a regular port write you can set or reset anything from zero to all bits in the output port without touching bits which should not be touched. Single instruction, same CPU cycle, intrinsic atomicity (is that a word, it sure is a combo of 9 letters).

Neat feature.

I'm not sure what happens when both the set and reset bits are written to. I think it would toggle the output bit, but I'm not sure.

Go on. ARM7TDMI had steering registers before the minellium. Xmega has had steering registers since 2007.
Your STM32F103 is vintage 2007. The new Tiny817 has got steering registers.
.
Oh, and the M3 STM32F103 is trashed by the Cortex-M4.
The actual GPIO performance depends on the actual PORT Silicon. The Xmega trashes M0 and a lot of M3.
.
David.

If you go back over several months you'll find various blogs on benchmarking various chip families.

#1 This forum helps those that help themselves

#2 All grounds are not created equal

#3 How have you proved that your chip is running at xxMHz?

#4 "If you think you need floating point to solve the problem then you don't understand the problem. If you really do need floating point then you have a problem you do not understand." - Heater's ex-boss

Steering register is the name given to this mechanism. i.e. setting a bit, clearing a bit without the delay and atomicity difficulty of RMW.
My point was that this is nothing new. And the STM32F103 is a mature chip.
Furthermore, the Xmega is a similar vintage with similar mechanism and better GPIO performance.
Of course the 32-bit ARM core has a faster processing throughput.

Triggers a memory about a story floating on (or sunk deep into) the 'net.

It was about a uC occasionallly losing bits in it's I/O configuration registers. The whole uC (or FPGA?) kept running happily, just some outputs stopped outputting the right signals untill the uC got a hardware reset. Then it worked all perfectly for a while, so no hardware pins blown, but the problem kept recurring.

Gotta watch those weak bits....long ago, during college a student built a pretty neat "robot" & was giving a demo to the campus reporter, who was taking a bunch of up close action photos. Apparently the camera flash disrupted a few EPROM program bits, giving the robot a serious case of spasms, nearly tearing itself apart.