i put a bad picture of the small one on the wiki awhile back.
if you follow the links for the variant below, there's a 'user guide', schemetic and sample code.(not sure for which board)
code probably for SPL and uvision5(??)

I haven't finished it yet, but so far I understand the job of saving the SP and registers is left to the Called routine, not the calling one.
Since from the loader the called routine is the startup assembler code the linker script places at the start, and we change the SP already before calling it, I think we need to manually provision for saving that before the call.
R0-R3 do not need to be preserved by the called routine. So we need to do this from the loader:
1. Save R4-R11 to the stack
2. Save SP to a known iram position (just a pointer type variable should work).
3. Save a return address somewhere (exRAM or IRAM?).
4. Change SP
5. Call App
----
6. (Return address pointing to this instruction). Load SP from variable in step 3.
7. Pop R4-R11 from stack.
8. Reconfigure NVIC VTOR.
9. Enable interrupts

From the app, to return, we have to:
1. Disable interrupts.
2. read the address saved in step 3 above.
3. load it to the PC, that will take the PC to step 6.

The peripherals state will be indeterminate at that point. Specially the USB peripheral, and any other using a buffer, since the buffer address was probably changed, and other configuration registers too.

EDIT: The more I think about it, the more I think it may be better to reset the MCU to restart the program in flash when the app is finished like Roger suggested, otherwise there too many things to control.
Of course you can make sure the Loader and the app do not use the same peripherals, so the peripherals for the loader have not changed state on return, but that limits what you do. Seems better to just reboot and let the loader pick a new app.

Last edited by victor_pv on Thu Jan 12, 2017 1:36 pm, edited 2 times in total.

You can make the sdfat buffer much bigger, if you haven't already, it is configurable, and that should speed up reading and writing files as bigger portions of the FAT table can be kept in that buffer.

The jumpback from APP via reset may work. From long term perspective I would vote for a mechanism as you have described (store/restore the xLOADER's context). The reason is sometimes in the future you may decide a "context switcher" could handle the APPs..
In big OSes the kernel is isolated from user's APPs physically, so the kernel context and APP one is preserved.
We may agree the only peripheral xLOADER is using is the Serial1. That is the xLOADER's console. I already have CL parser working in the xLOADER, I will update xLOADER soon.