I have built a fully functional ARM7 prototype board based on the
Atmel
AT91R40008 processor. Everything works fine, but the performance of
the
processor is approximately 1/10th what it should be. In a simple in
SRAM
memory write test, I first copy my code to SRAM, and then run out of
SRAM
and write blocks of 32 bytes to consequetive locations in an unrolled
loop
for a total of 9600 bytes (a simple test buffer) then do this loop 8
times,
so the scope can get a good lock. The original C/C++ code and the
dissasembled ARM code are below for reference. The key element is that
other
than the looping overhead the instruction stream should be nothing
other
than fetch, decode, execute of store byte immediate to internal SRAM
of the
form:

STRB Rn,[ip,#dd]

At worst case this should take 1-3 cycles per operation, I am scoping
this
and getting a memory write every 40 -"FORTY" cycles approximately!!!!
This
is bizzare. Of course the External bus interface settings are
irrelevant for
the internal bus, and I am not pulling on the external nWait pin. I
hypothesize that the processor is in some mode after reset and running
slower?
Maybe has something to do with the debug interface, I am not sure,
nothing I
have found in all 3000+ pages of ARM docs lead me to any
conclusions...

As another brief example, this is the C/C++ code for a max speed I/O
toggle, I basically have a scope on one of the I/O pins and I am
toggling in a loop at max speed and then looking at the waveform:

I have remapped it perfected, the EBI wait states are irrelevant for
internal access, but of course they are set properly. The remap
operation, everything is perfect. The code is running out of SRAM at
0x00000000, its on the chip, simple as that, and running 10-50 times
slower than it should.

The AT91R40008 does not have a programable PLL, as far as I can tell
the only way to slow or stretch the clock out is to pull down nWait or
to put the system into debug mode, I am doing neither.... Here's the
actual chip for reference:

Hmm. The Atmel docs do say that byte and word access to the internal
RAM is a single-cycle operation. However, they also talk about
a mode that allows you to use the internal RAM to test apps that
will go into flash. I wonder if that means that the processor,
when set up that way also emulates the wait state settings for
the external bus.

Another question is: if you are running the code in internal
RAM and are reading and storing bytes in internal RAM,
what external signals are you monitoring with the scope?

The EBI bus interface still outputs all the internal activity, the
address bus, control bus, etc. all still do their thing, only the chip
select lines nCS0-nCS3 will become active on an external address,
also, all internal SRAM accesses are 0 wait state. However, we are
still talking about nearly two orders of magnitude slowdown here. If
if there were gremlins and the system was talking to a flash or
external memory (which there is none) at 8-ws plus the 8 data float
then that would be 16 cycles per instruction more or less, I am
talking about 250-500 here, its truly bizzare.

And conversely if it involves external SRAM the ARM core speed is
largely irrelevant once you run it faster than 1/SRAM_access_time.
So for 70ns SRAM you needn't bother trying to exceed 14MHz.
For 70ns 16 bit SRAM that drops to 7MHz for STR or STMIA.

EBI setup is probably the one to watch.

As long as the byte lane strobes are wired up word, half word, and
byte accesses take the same time to word wide memory, indeed with
narrower memory configurations STRB would be 'faster' since you're not
having to slice the oversized read/store up into multiple accesses,
Sprow.

Out of curiosity, how does the ARM handle the transfer of a byte
to an odd address in 16-bit memory? Does it shift the byte to
bit positions 8..15, then do the equivalent of loading the
full 16-bit word from memory, moving in the high byte, then storing
the resulting 16-bit word back to memory? Or is there some other
mechanism? The method described would take two memory access
cycles---which could be one clock each, I suppose.

The bus interface does. That is outside the ARM core and varies
from one vendor to another. The most common method is to put
the value on bits 8-15 of the data bus and only assert the write
line for the "high" byte. If I were a betting man, I'd wager
that the value shows up on bits 0-7 of the data bus also, and
the only difference between a byte-write to an even address and
a byte write to an odd address is which of the two byte-write
lines goes active.

IMO, nobody in their right mind would do it that way.

Read the manual for the part in question. It will say exactly
how it's done.

--
Grant Edwards grante Yow! Is something VIOLENT
at going to happen to a

The arm shifts the data in the half word, in worst case you loose 1
cycle, however, the internal SRAM has no restriction, its 1 cycle
access for byte, word, quad. And of course we are running out of SRAM
internally -- this is totally on the chip, nothing, but the chip, a 66
mhz clock, and a scope/LA watching everything, the key is that the I/O
pin I am toggling with the loop:

while(1)
{
write(1);
write(0);
}

Which assembles to 5 instructions is toggling the I/O at 250-260
clocks per instruction, and of course this is the same for external
memory, EVERYTHING, there is something deeper going on than simple
explanations. The ICE embedded in the ARM has to have something to do
with this, I have a scary feeling its clocking the entire boundary
scan each cycle around to the JTAG port, there are 100 pins on the
ARM, and I am getting 100 times slowdown -- coincidence?

Cache would be nice, but no cache to be had in this case.
Things get more fun because if you continually cause cache misses then
you end up with 1/SRAM_access_time again while cache line fills occur,
this is well documented in long boring papers elsewhere I'm sure!

Moot, but worth saying.
I thought in the original post that you looked at it on a scope?
Sprow.