These measurements were made on a Teensy 2.0. These include the looping and array write overhead inside Print, with the burden of a 32 bit variable in the code holding the start time during the test. Overhead to actually store the data to the output device buffer is not measured, so you should see very similar results between official Arduino and Teensy.

There is another speed optimization to make when you are using divmod10 to convert a long into base-10 digits: Only the first few digits of a 10-digit number need full 32-bit precision. Once the high byte is 0 you only need 24 bit precision (saves at least 10 cycles), then when the two highest bytes are 0 you only need a 16 bit divmod10 (saves over a microsecond). Once the three highest bytes are 0 the 8-bit divmod10 only takes a microsecond, and for the very last digit you don't need to do a division at all.

The downside is an increase in code size of about 200 bytes. I've attached a modified Print.cpp including all the assembler macros, my profiling indicates it's 0.5us-1.5us faster per digit on average (which adds up to 7us faster for 5-10 digit numbers). I have not fully tested for accuracy yet.

I have used the ideas here to improve the speed of logging data to SD cards. I wrote functions to print a numerical field to an SD card that are as much as five times faster than the standard Arduino print.

I am now able to log data from four analog pins on an Uno to an SD at 2000 Hz. This is 8000 ADC values per second. This is near the limit since each conversion takes about 110 microseconds.