define a function Serial.beginPrescaler(int prescaler) that does the same thing as Serial.begin, but with the prescaler for USART as a parameter

My reasons are simple:

Arduino works with an 8-bit RISC, one should use 8-bit variables unless more bits are needed. Since the hardcoded buffer size is 128 bytes anyway, unsigned char is enough to store the head and tail. Immediate gain is half the size of the code and twice the speed.

The compiler compiles % RX_BUFFER_SIZE literally as a module operation (essentially division) for some reason, instead of the obvious optimized & (RX_BUFFER_SIZE - 1) (probably because int is a signed number). 16-bit division on a 8-bit RISC without a division operation is a disaster. The code is hundreds of bytes longer and takes hundreds of cycles. Because this code is in an interrupt routine that gets executed every time a byte is received it is critical to make it optimal. Note: it seems that the compiler does the optimization automatically when you use unsigned char instead of int.

This would be a convenience for advanced users. One can avoid the 32-bit division (another disaster) in the Serial.begin routine and it would be more apparent that the baud rates like 57600 and 115200 are not optimal for 16MHz clock.

I did some testing:

An empty sketch (empty setup() and loop()) with the standard library takes 976 bytes. When you do the first 2 modifications, it's only 852 bytes. That's a difference 124 bytes for everybody, no matter if they're using Serial. (The reason for that is the interrupt routine that is always linked). Also the other functions in Serial are somewhat shorter.

An empty sketch with one call Serial.begin(57600L) takes 1244 bytes. It's only 1052 with the 3rd modification. That's another cca 200 bytes saved.

But the main advance is the performance gain. The interrupt routine is heavily utilized when receiving data and with the current library it takes around 250 cycles only to read one byte. That's 15 microsecs. When reading 10 KB/s, 15% of the processor power is spend on that. With 1Mbaud, the routine is not fast enough to read the incoming bytes and they are dropped. Also the standard practice is to have if (Serial.available() >0) in the loop(), which takes as much as the interrupt. The modified routine happily works with 1Mbaud speeds.

use & RX_BUFFER_MASK instead of % RX_BUFFER_SIZE in the whole file :The compiler compiles % RX_BUFFER_SIZE literally as a module operation (essentially division) for some reason, instead of the obvious optimized & (RX_BUFFER_SIZE - 1) (probably because int is a signed number).

The current scheme allows a user to redefine the buffer size if they're low on ram, without having to understand or worry about having to pick a power of two. I'm not sure it would be a good idea to lose that capability.

I'm very unimpressed that the compiler did not manage to optimize this, even with rx_buffer_head and rx_buffer_tail redfined to UNSIGNED char as per your first suggestion :-(

I did manage to get it to generate andi instructions with some pretty heavy handed casting (that I don't like.) I want to do some more experiments with avoiding expressions that generate intermediate results that the compiler is unable to optimized as unsigned bytes.

// if we should be storing the received character into the location // just before the tail (meaning that the head would advance to the // current location of the tail), we're about to overflow the buffer // and so we don't write the character or advance the head. if (newhead != rx_buffer_tail) { 702: 80 91 07 01 lds r24, 0x0107 706: 98 17 cp r25, r24 708: 31 f0 breq .+12 ; 0x716 <__vector_18+0x38> rx_buffer[rx_buffer_head] = c; 70a: ff 27 eor r31, r31 70c: e3 5f subi r30, 0xF3 ; 243 70e: fe 4f sbci r31, 0xFE ; 254 710: 20 83 st Z, r18 rx_buffer_head = newhead; 712: 90 93 08 01 sts 0x0108, r25 : 720: 0f 90 pop r0 722: 0f be out 0x3f, r0 ; 63 724: 0f 90 pop r0 726: 1f 90 pop r1 728: 18 95 retiWish I understood WTF it's doing saving/restoring R0/R1; looks pretty pointless (makes one of them a known zero, that isn't used. Uses the other to save SREG, but it could have used one of the other registers it had to save anyway. (is this part of the calling sequence definition, that subroutines (except for ISR) can trash r0/r1? That'd make a little more sense, but they still oughta be optimized away!))

Huh. That's neat. The compiler/linker included with Arduino0013 eliminates C functions that aren't called, even if they're part of the same source code as C functions that ARE used, and apparently even with some level of indirection.

In the current case, my little serial test case didn't use serialRead, and neither HardwareSerial::read or serialRead show up in the object file. This held true for other functions too. For example, my sketch got 200 bytes shorter without the call to Serial.begin (which I THINK will leave the serial port at the 19200bps from the bootloader?)

Does anyone have a pointer to the specifics of how this dead function elimination works?

I guess it wouldn't be a big deal to note next to #define RX_BUFFER_SIZE 128 something like// use only 16, 32, 64, 128 or 256 !!!But I agree that % RX_BUFFER_SIZE is fool-proof.

To me, it seems to me that the compiler optimizes % to & when I use unsigned char instead of int. But writing it explicitly would force it.

Pushing and popping R0 and R1 from the stack is weird, especially when it doesn't do the same with the registers that it's actually using in the interrupt.

Why is 1Mbaud a bad idea? It works great for me. Obviously I'm not sending data to Arduino, but from Arduino to PC. I need a real time info about what it is doing, and that produces 50kB/s of data. That's easy to handle by my PC =) But I needed the Arduino's side to work to be able to decode short commands sent from PC. The little hack from my first post just did the trick. I think that everybody would benefit from a smaller faster core library.

To me, it seems to me that the compiler optimizes % to & when I use unsigned char instead of int. But writing it explicitly would force it.

It's rarely a good idea to play games like this: to pretend to know what the compiler will produce in machine-code for a given high-level code, and to then try to write high-level code that will produce a specific machine-code result. Sometimes, the compiler will be better than you.

Write what you intend from a functional point of view, and if you don't like the compiler's optimizations, then write in machine-code or improve the compiler. Source code is for the humans.

@halley: I cannot agree more. But I'm just proposing the simplest fix I can think of to make the core library routine around 5 times faster. For me, % gets compiled as & when I change int to unsigned char. So I would just keep it. But westfw had a problem with that and had to do some typecasting. If someone prefers to write it in assembly instead, sure, go ahead, but that'll be less accessible for even more people.

EDIT: So as westfw proposed, the code that appears to do exactly what is expected is:

I like my second patch better (the one that avoids having the c compiler have to deal with intermediate results. Casts are bad, especially for this sort of purpose, as Halley says. Whereas C's "promote to integer" "issue" is reasonably well known.)

Although, if we're going to change the serial library, I want an optional tx buffer too. It's obnoxious that Serial.print() is as slow as it is (and this isn't the code being slow, it's the wait-for-the-UART algorithm.)

@westfw: Sorry, I wasn't careful enough before. Even for me, % gets optimized to & even without typecasting, but the compiler then works with 16-bit words instead of 8-bits. Your suggestion works the best.

For the TX buffer, the following code works for me (for ATmega168). You can simply replace the serialWrite function in wiring_serial.c:

if ((!(UCSR0A & (1 << UDRE0))) || (tx_buffer_head != tx_buffer_tail)) { // maybe checking if buffer is empty is not necessary, // not sure if there can be a state when the data register empty flag is set // and read here without the interrupt being executed // well, it shouldn't happen, right?

// data register is not empty, use the buffer unsigned char i = tx_buffer_head + 1; i %= TX_BUFFER_SIZE;

// wait until there's a space in the buffer while (i == tx_buffer_tail) ;

tx_buffer[tx_buffer_head] = c; tx_buffer_head = i;

// enable the Data Register Empty Interrupt UCSR0B |= (1 << UDRIE0);

} else { // no need to wait UDR0 = c; }}

It is useful for lower baud rates (lower than 1Mbaud , because the interrupt itself takes 4us to execute, while with 1Mbaud it takes only 10us to send or receive a byte. With 1Mbaud, there might be an interference of the RX and TX interrupts causing the receiver to miss incoming bytes.

Thanks for the code; that'll be helpful.I had in mind modifying wiring_serial.c so that it was "smart" and didn't take up any more space if you had "#define TX_BUFFER_SIZE 0", although I have mixed feelings about conditional compilation. (The compiler will nicely optimize away C code that can never execute:

if (TX_BUFFER_SIZE > 0) { // buffered output code that is not included if // TX_BUFFER_SIZE is a constant zero} else { // Unbuffered output code} But I don't think that works for the ISR (Hmm. Need to check, especially with the newer compiler/linker. Bet it won't eliminate even empty, never-called, ISR routines...)