Building a better memcpy()

I’m wondering if it’s about time we revisited the old idea of creating a general-purpose memory copy SWI which will have its implementation tailored to give the best performance on the host machine. With the number of devices the OS supports (A7000, A7000+, StrongARM RiscPC, Kinetic RIscPC, Iyonix, Raspberry Pi, multiple Cortex-A8 boards, multiple Cortex-A9 boards, Cortex-A15) I’d be surprised if our 23 year old memcpy was optimal for any of them (Or any of the other bespoke memcpy’s that are littered around the code base – DMAManager, MbufManager, buffer manager, Wimp_TransferBlock, …)

Unfortunately for me this is another thing that falls into my “so much to do, so little time” category, so if I were to do it all myself it would probably never get finished. But the good news is that I know there are a few people here who like to dabble in assembler optimisation tasks, so maybe if we were to join forces we’d be able to get somewhere pretty quickly. We might not end up with implementations that are optimal for all targets, but as long as we have at least a couple of implementations which beat the current one I’d class it as a win (and more can easily be added later).

So what are people’s thoughts? Would anyone be interested in a bit of friendly competition to find the best memcpy() implementation? I could knock together a testbed app to rank the different entries in terms of performance (taking into account the host machine type, block size, source/dest alignment, source/dest cacheability), and then we could all enter a routine or two which we’ve tailored towards our own platform of choice. The winner would receive fame and fortunecitation needed by having their code included in the ROM. (We’d also need a volunteer or two to run the testbed app on some of the different hardware targets, since I don’t own all of them myself).

In terms of the actual implementation, I’m thinking the requirements for the routines should be:

This is effectively the same as C memcpy() except it doesn’t return the dest address in r0 (that’s a C-ism which most code won’t be interested in)

May be called from both privileged and unprivileged modes

Must not write outside the destination area (e.g. when dealing with unaligned dest, you can’t read-write-modify the first/last word. This is to make sure the code is IRQ-safe)

Must cope with overlapping regions

The intention will be to make this code available via both a SWI and a direct function call (allowing the SWI despatch overhead to be avoided in time-critical code). Perhaps also including a “return me an optimal routine for these parameters” call so that if you’re going to be repeatedly copying data with similar parameters you can get a routine which will skip the initial call-specific checks (e.g. where it picks the right routine for a given length, cacheability, alignment, etc.). This is the reason for the “may be called from unprivileged modes” requirement.

Since the testbed app will be measuring the performance of the different routines in terms of transfer length, cacheability and alignment, the individual routines which people implement and submit don’t necessarily need to worry about these things themselves. So if you find a great way of copying page-sized, page-aligned blocks then you could submit that and we’d configure the testbed to only test it under those circumstances. Similarly, if your code doesn’t cope with overlapping regions then we could teach the testbed about that. Then once we have all the results collected we can build all the routine-specific rules and the performance info into the memcpy() SWI so that it picks the right routine for each situation (and then re-test using the testbed to make sure the extra overhead doesn’t make things too slow!)

We don’t have any machines with >2GB of RAM to test with (or at least, the logical memory map doesn’t have a >2GB chunk of free space we can use for testing)

If the length is <2GB you can treat it as a signed number, which might make things a bit easier to optimise

If we do end up with a suitable machine in the future then it should be easy enough to retest everything and fix/workaround any routines which don’t work with large lengths (e.g. fix the code, or just split it into two <2GB transfers)

If you specify r2 as a uint32_t and can think of an algorithm that benefits from signed integers you can split a 4GB transfer into 2 2GB transfers in the implementation. That way the user doesn’t have to do the spitting.

Perhaps also including a “return me an optimal routine for these parameters” call so that if you’re going to be repeatedly copying data with similar parameters you can get a routine which will skip the initial call-specific checks

You might also want this API feature to be able to skip any ‘pointless’ sanity checks on the parameters you’ve passed in (e.g. null pointers, zero length, etc.) where you already know that you’ve taken care of that yourself.

One of the requirements I listed is that the routines should be able to cope with overlapping regions, so that we can use it for memmove as well.

Yes, that’s the ideal answer. However, if for some reason you can’t figure out a good overlapping version of your routine but you believe your non-overlapping code is super-amazing, I imagine you could always fall-back in the overlapping case to someone else’s memcpy implementation that can handle it. Just having another data point would be valuable.

1) create a swi that returns the 4 entry points. This is conditionally compiled depending on the machine ie the implementation is different on every machine.
2) Rip the code out of sharedClib and use that for now.
3) sharedClib uses the new swi if it is available otherwise it falls back to its existing code – that way it’s not dependent on the new swi being available.

Now you can change the new swi implementation of any of the routines any time you like without affecting programs which use it – other than speeding them up homefully.

I don’t think a memcpy SWI is a good idea. SWIs run in supervisor mode. A bug in a user mode app that calls memcpy with the wrong pointer will merrily write over I/O devices or whatever it might be. As well as an obvious privilege escalation vulnerability, it’s highly likely to cause pain and suffering to somebody along the way.

I don’t have an objection to calling a SWI that returns a function pointer, but the code must run in the same mode as the caller. That means if it tries to access something it shouldn’t, then it data aborts as would be expected. If the routine has a C function name embedded, that also means that unwinding the C stack will return a sensible backtrace.

is pretty nasty to the pipeline (and cache?), so I wonder what the overhead of this will be? Otherwise we’re just reinventing shared libraries here, except without a linker to fixup the branches at load time.

How about fixing the SharedCLibrary memcpy to Do The Right Thing (since many apps already use that and it’s already dynamic linking of a kind) and simply provide a more friendly API to get the function pointer without having to do a full SCL initialise?

@Jeffrey: Sounds like a good idea…same problem here, hope to find time for that ;-)

Meanwhile there’s an interesting in depth article here about that on the Cortex A8, Link here

It really depends on the specific CPU, I think. In some small tests with the Cortex A15 and A9 I found that especially NEON memory transfers are much faster than ARM transfers when the memory to be copied resides within the 2nd level cache.

Also, FYI, one of my colleagues has been looking at speeding up things like memcpy() using hardware (not on RISC OS). His conclusion is that Unixy-type apps don’t use memcpy() very much, so the performance benefit is limited (pretty much in the noise). RISC OS is a different thing, of course, but not to get hopes up as to how much improvement there will be. OTOH we may be underestimating how badly some bits of RISC OS are coded ;-)

I don’t think a memcpy SWI is a good idea. SWIs run in supervisor mode. A bug in a user mode app that calls memcpy with the wrong pointer will merrily write over I/O devices or whatever it might be. As well as an obvious privilege escalation vulnerability, it’s highly likely to cause pain and suffering to somebody along the way.

Good point.

So we’ll probably want two forms of the “return function pointer” SWI – one which returns a generic memcpy (i.e. which does all the size/alignment checks each time it’s called) and one for returning the optimal memcpy for given parameters.

Doing a double branch is pretty nasty to the pipeline (and cache?), so I wonder what the overhead of this will be? Otherwise we’re just reinventing shared libraries here, except without a linker to fixup the branches at load time.

SCL memcpy() already involves a double branch of exactly that form, because it has to go via the stubs. The only way I can think of avoiding it would be to make memcpy an inline function which loads the real memcpy pointer from a global variable. This inline function could also handle returning the ‘dest’ value, allowing (a) the function pointer to be a pointer directly to the kernel memcpy for platforms which support it (otherwise, SCL memcpy) and (b) the compiler can optimise out the returning of ‘dest’ for the situations where it isn’t needed.

How about fixing the SharedCLibrary memcpy to Do The Right Thing (since many apps already use that and it’s already dynamic linking of a kind) and simply provide a more friendly API to get the function pointer without having to do a full SCL initialise?

Yes, that could work. Although it might be nice if it was in the kernel, simply because there are some cases where the kernel would benefit from using it (rectangle copy, sprite plotting, copying pages when a dynamic area claims a page that’s already in use, etc.)

Also, FYI, one of my colleagues has been looking at speeding up things like memcpy() using hardware (not on RISC OS). His conclusion is that Unixy-type apps don’t use memcpy() very much, so the performance benefit is limited (pretty much in the noise). RISC OS is a different thing, of course, but not to get hopes up as to how much improvement there will be. OTOH we may be underestimating how badly some bits of RISC OS are coded ;-)

Yes, I think most apps realise that not copying data is quicker than copying data. But at OS level where you have to deal with shepherding data into and out of hardware I/O buffers, software acceleration of graphics ops, etc. using a fast memcpy can make a big difference. Although it would obviously be nice if we can also cut down on the places where data is being copied around!

Meanwhile there’s an interesting in depth article here about that on the Cortex A8

I came across that when I was looking to see what else I might do with VFP/NEON apart from floating point. It looks as though usng NEON registers with pre-load would be profitable and you do not have to stack ARM registers. On the other hand, you do have to set up the VFP context, I think.

I am pretty naive in this area, but I used the routine in the PRM offered for file operations when adding memory transfers to Basalt. It does all the housekeeping and seems to run pretty fast compared to Wimp_TransferBlock, which I had thought of using. I have no idea how this compares to memcopy(), but it would look like a good option for all the early processors.

Or any of the other bespoke memcpy’s that are littered around the code base – DMAManager, MbufManager, buffer manager, Wimp_TransferBlock, …

Add to that RamFS and CDFS. Actually, the CDFS case is worth a lingering look – it used to be implemented as CD_ByteCopy as a SWI, but eventually somebody worked out it was slower than just calling with BL (I think the build switch is still in there). Having single stepped through the SWI despatch just the other week it’d have to be a big memcpy() for it to be worth a SWI, but the function pointer return should be fast since that’s no different to branching to the C library.

We don’t have any machines with >2GB of RAM to test with (or at least, the logical memory map doesn’t have a >2GB chunk of free space we can use for testing)

There is, I believe, one big ARMSOC out there that has 4 GiB of RAM.

I think the watch word was logical address space, with things like RamFS peeing logical address space out of the window, address space exhaustion is already here. Some of the newer machines have 40 bit LPAE, but not RISC OS just yet.

I don’t think a memcpy SWI is a good idea. SWIs run in supervisor mode

If the caller’s mode is known (wherever the SPSR is) you could always drop down to user mode, or use LDRT/STRT so that it’d abort as though you had user privileges. The downside to sitting in SVC mode while chewing through memory is no callbacks fire.

His conclusion is that Unixy-type apps don’t use memcpy() very much

Certainly the ROOL (Norcroft) C compiler inlines memcpy pretty often, though I don’t know what its heuristics for choosing that over the library copy are. But there are plenty of other places that RISC OS copies reasonable chunks of memory around in the graphics/file system/network stack.

One slight downside for any modules wanting to try faster memcpy is they’d need to carry at least a simplistic fallback copy if they’re provided as disc loadable modules for versions that don’t implement the proposed API.

In terms of the actual implementation, I’m thinking the requirements for the routines should be

What’s the stance on the following extra rules?

Can the NEON or other coprocessor registers be used?

How would the routine get hold of any context (eg. for workspace)?

Can it be non blocking (eg. if there’s a memory to memory DMA controller or a spare CPU core hanging around)?

Yes, although I haven’t decided yet on the best way of handling the context (two versions of the NEON routines, one which assume a suitable context is set up and uses AAPCS register clobbering, one which sets up the context itself?)

How would the routine get hold of any context (eg. for workspace)?

R12 would be the RISC OS way of doing things, but would break the APCS compatibility. But then again that might not be a major issue, since most people wanting to use it from C will just be able to rely on the builtin memcpy() wrapper

Can it be non blocking (eg. if there’s a memory to memory DMA controller or a spare CPU core hanging around)?

In the long term: yes (extending DMAManager to support memory-to-memory DMA being another thing that’s been on the todo list for a while). But for the moment: No. We need to aim for something memcpy-compatible, so it needs to be blocking and shouldn’t alter IRQ/FIQ state.

I fear that putting all of this criteria into a single function might land us with the SpriteOp of memory copies – a very capable function, but not one that is necessarily optimised for any given case (due to having to cater for so many variables).

That’s basically what we’re aiming for. The generic memcpy will actually branch to one of many different implementations depending on the input parameters (much like the current SCL memcpy). And for the cases where you know your requirements beforehand there’ll be the “give me an optimal memcpy” call which will look at all your parameters and then return you a memcpy that’s optimal for that case (and will most likely do bad things if you change the data alignment etc.)

As has been mentioned, you could do worse than starting by translating the memmove I wrote for Raspbian from the horrid GAS assembly syntax that Linux uses to objasm format. It managed a substantial speed boost over the (also supposedly hand-optimised for ARM11) version they’d been using previously. I’d have done it myself by now if I had the time, too! The main thing you’d need to watch out for is its use of unaligned loads – smaller copies benefit from the fact that Linux always configures them to ARMv7 style. On RISC OS, we’ve gone and made it a user option whether they abort or not and if they don’t, how they work; you can’t even read whether or not they would abort from USR mode so you probably have to avoid them altogether, just in case.

I’d reiterate what’s been said about the fact that you’d need multiple routines. My version isn’t even optimal for ARM11’s in general – it has special hoops it has to jump through to work at its best with the L2 cache in the BCM2835.

I wouldn’t get too hung up about limiting the length to 2GB. If you’re working in assembler (which you surely would be) there’s no good reason not to use the carry flag to do loop termination, which would allow for copies up to 4GB-1.

Is there really a need to specify not to use read-modify-write? That sort of implies the destination pointer is a volatile *, which isn’t the case for C memcpy at least. Though in practice, I’d be rather surprised if the fastest implementation did read-modify-write anyway, as that would mean having to do a cacheline load rather than just go direct into the write buffer (most ARMs don’t seem to have allocate-on-write caches).

Overlapping regions – yes, it’s just the overhead of a few compares and a branch, but if you know you have memcpy semantics rather than memmove (e.g. the pointers were restrict *) then you might as well save those cycles and have separate entry points for the two cases.

You could benefit even with the same CPU for different routines depending upon whether the source is uncached / uncacheable, or most likely in the L2 cache, or most likely in the L1 cache, and orthogonally, whether the destination is uncached or in the L2 or L1 cache, whether the cache is configured writeback or writethrough, and if the memory is bufferable or not (IIRC that makes a big difference to write merging).

Agree with it not being desirable to wrap them all in SWIs – not just about privilege escalation, but also dispatch time. LDRT/STRT versions might be useful, though not that many RISC OS APIs require that pointers specifically be USR-mode-accessible. We do have an advantage over Linux there because SVC-mode code isn’t preemptable so we only need do LDRT/STRT once per page; the other transfers within each page can be done 64 bits per cycle on anything with an internal 64-bit data bus (ARM11 and upwards).

The NEON register file is often worth the effort on CPUs that have it, if only because you free up more integer registers for a more complex preload scheme. The exception would be if the data is in the L1 cache (e.g. if you’re tiling the same small sprite).

It’s also worth noting that on newer ARMs (Cortex-A15 or newer) you start getting automatic background prefetches, which actually means it’s less likely that hand-optimising things like memcpy are actually worth the effort because it’s such a predictable pattern of loads and stores. You might find that the “downwards” case of memmove would still benefit from prefetch though – I’m not sure if the prefetchers are bright enough to handle those well.

Thanks for your thoughts, Ben! Hopefully we can at least get one “better than current norm” implementation for each platform before we get too bogged down with situation-specific variants.

Is there really a need to specify not to use read-modify-write? That sort of implies the destination pointer is a volatile *, which isn’t the case for C memcpy at least.

It’s not about whether the destination buffer is volatile, it’s about whether the area around it is in use by another thread. The C standards seem to be somewhat vague on the rules, but I’d be surprised if it would be considered acceptable for a memcpy implementation to write to arbitrary bytes surrounding the destination buffer (unless it’s an atomic read-modify-write, with the bytes not being modified). After all, where would you draw the line on how much extra memory can be touched? Word boundary so you can use LDR/STR? Doubleword boundary so you can use LDRD/STRD? Quadword boundary so you can use NEON? etc.