the glue code took most of my time, as i had to first understand what (the hell?) is going on in there. there are some major problems when function calls are made, for example calling something libc from the "virtual machine". my current solution, which is basically - passing an address table around in assembly, may urge the need for some facepalm-like gestures in certain developers.

mind that this a soft-float port to ARM, which will run slow, but on pretty much everything. a VFP version can be possibly branched out in the same build, while FPA and FPE do not make much sense to be implemented in my opinion, since the support is minimal (afaik).

only some basic operators and functions are implemented at this point, but the semantics are in place.

i was curious on the performance situation when using software floating point in comparison to hardware, so i had to run some tests in this aspect. the only adequate way to get at least somehow accurate measures is my case, not having a real ARM device, while running either in a simulator or a VM, was to see what happens when x86 handles optimized software floating point and draw some conclusions from that.

instead of looking for the GNU build of their soft-float library i wrote a quick version of floating point addition that takes into consideration everything that the FPU might do, such as check for NAN, infinity and round to nearest as the default rounding mode. i've used some compensation trickery for the actual measurement code to neglect any possible small deviations, caused by compiler optimizations, pipelining or OOE (if that is even possible). this is greatly simplified on a single core x86 with the TSC if you can get the OS into a passive mode.

GCC -O3 does a great job optimizing the function into something that might be considered "difficult to follow" x86 assembly (not that x86 normally is), but the performance is excellent. while these numbers will be completely different on an ARM CPUs (and overall the code will be much slower), i think that i cannot confirm that hardware floating point arithmetic is thousand of times faster than software, information for which i took from various small articles and more explicit hardware documentation. i would speculate a 10-30 times faster execution for VFP's FADD over a unoptimized software version on ARM.

if someone is interested i can post the test code.

p.s.
i was able to fry something on my MB/AGP port, so currently my graphic card only runs in VGA mode, but i guess i will continue slowly the ARM port after i have a better platform to work on (unfortunately this affects my job-work as well). to my surprise watching a low-res "modern" video on a native player and low-res flash (e.g. youtube) works ok even without hardware acceleration and high AGP transfer rates.

Very cool! I'm about to push some new EEL changes online, including a bytecode interpreted mode (that is portable)... Now I'm tempted to go find a Raspberry Pi to help port the native ARM version (with FPU I hope?). Sorry if all of our EEL changes cause merge hell :/

Very cool! I'm about to push some new EEL changes online, including a bytecode interpreted mode (that is portable)... Now I'm tempted to go find a Raspberry Pi to help port the native ARM version (with FPU I hope?). Sorry if all of our EEL changes cause merge hell :/

no problem,
there isn't much of a trouble merging, really...

the CPU in Raspberry PI is a bit outdated - ARM1176JZF-S, but has a VFP unit and is good enough for development. i wanted to get soft-float support in, because unlike the x87, which will probably be there for quite some time, ARM might soon decide to deprecate the VFP unit at some point to save dye space (and thus force use of the newer NEON SIMD only or come up with something else). there are a lot of ARM CPU's that have different floating logic and are simply not compatible (VFP,NEON,FPA,FPE).

the register exchange (CPU-COP) and overall the instruction sets are pretty straightforward.

i wouldn't consider working on a mobile device, unless its possible to attach a real monitor, mouse and a keyboard to it. also, i don't think serious programmers can be convinced that Android or iOS are better than something like Debian for development.

for the sake of running on a mobile device i did run a previous build of EEL2 on an Android phone, but then the build broke at some point. :\

ah nice, and I imagine you can mark pages as executable when jailbroken, too eh?

Check out ldid (Cydia), a tool that Jay Freeman (aka saurik) wrote; since Apple started their code signing requirements this is very useful to bypass it to allow an iPhone to execute binaries.

Rpetrich made an OS X port as well, so you can also add this step to a desktop building workflow before moving stuff onto a device. This way you can script all the required commands (e.g. make, chmod +x, ldid -S, scp) commands into a 1-click building/testing cycle.

the glue code needs some more work, but at least it compiles/runs now.

there are some slight differences to x86, ppc, since in all places i directly modify the pc/link instead of branching ("b"). this should be technically slower, but gives a 32bit jump. the reason was that bx was giving me some strange results (thumb mode) and on the other hand gas translated "bl" to something similar, if i recall.

the glue code needs some more work, but at least it compiles/runs now.

there are some slight differences to x86, ppc, since in all places i directly modify the pc/link instead of branching ("b"). this should be technically slower, but gives a 32bit jump. the reason was that bx was giving me some strange results (thumb mode) and on the other hand gas translated "bl" to something similar, if i recall.

--

Very cool! I'm learning a lot reading this...

Unfortunately I think we'll need to do some more tweaks to the code calling the glue, to support storing the offset elsewhere (in a data block, perhaps), because this code:

Actually (duh!), those jump instructions can be the 26 bit relative versions -- the addresses passed are relative anyway (but they are in bytes rather than dwords, which may need some tweaking). GLUE_MAX_JMPSIZE should be defined to the ~16 million max... I will update the calling code to use a GLUE_JMP_SET_OFFSET(instruction_end_buffer,offset) rather than having it directly replace the address using the GLUE_JMP_TYPE / GLUE_JMP_OFFSET / GLUE_JMP_OFFSET_MASK values (since the latter requires the address to fit in its own int or short).

I'd imagine that Thumb mode shouldn't even be considered, since RAM use isn't a concern. Also I'd be curious whether loading constants via PC-relative addressing and the associated branch is worthwhile; probably it would make more sense to either a) encode as 4 instructions (ugh), or B) make each codehandle have a table of pointers to load from (provided the count is small enough to be addressable). The latter is something I've considered doing for PPC, too, but it doesn't quite seem worth it as PPC can do constant 32 bit loads in 2 instructions...

I'd imagine that Thumb mode shouldn't even be considered, since RAM use isn't a concern. Also I'd be curious whether loading constants via PC-relative addressing and the associated branch is worthwhile; probably it would make more sense to either a) encode as 4 instructions (ugh), or B) make each codehandle have a table of pointers to load from (provided the count is small enough to be addressable). The latter is something I've considered doing for PPC, too, but it doesn't quite seem worth it as PPC can do constant 32 bit loads in 2 instructions...

yep, no thumb mode. the current scheme will also not work with it very well, since the port depends on 4byte offsets (and is using r8). the mode switching in itself is a bit confusing, complemented by the cpu model naming scheme that arm uses.

as far as i know the pc method of loading is the safest and the only way to load a full 32bit value.
there is also mvn (move + not), which can do for example:
ldr r0, =0xffffff00
could be:
mvn r0, #255
but will not work for 0xfffffe00.

gcc seems to use it quite a lot event for smaller values. this is a dump of the end of the <main> branch:

the second method you propose is something i've considered as well. there is already an address table dumped into a pool in GLUE_CALL_CODE (but it really should be in c, i think, and passed as a __asm parameter like you do with "consttab"). the table itself is passed to the nseel_asm_... methods to provide some function pointers, because i wasn't able to get the correct addresses of such in any other way. 256 values would be hardly reachable at this point, i think.

this would take loading a full double 2 instructions (or ~4 cycles (edit)) instead of 4.

this is a nice hack - using the the s suffix and encoding 24bit, and it can certainly work with this scheme. i think there has to be another instruction when setting the smaller portion (8bit) of the desired 32bit though, because we cannot use the barrel shifter and setting an immediate in one instruction, for example:

while the naive version may suffer from lack of potential pipeline optimization, the previous version has two potential stalls: one at ldr and one at beq.

i've been reading more on how ldr works on cpu/mpu level and it does depend on a lot of factors. it will normally take 2 cycles, while it can still take one cycle if it can be pipelined, in a case, where no involved register operation follows afterwards. this is somehow difficult to achieve if constant loading will be a macro.

using ldr rx, [pc, n] can be a 2-3 cycle operation and if a stall occurs it will be caused by the "fetch unit" (fetch-decode stage). if cache performance is of consideration here, it would be interesting to compare what are the benefits (if any) of using a global pool (address of, stored in a register, e.g. GLUE_CALL_CODE) in comparison to the pc relative offsetting in regard of caching, mapping, tlb, fetch timing, etc.

but in general, we can still provide a local pool per section that will be outside of the return branch (which will theoretically still take 2-3 cycles):

if performance becomes really of greater concern later on, constant definition could become section specific in attempt to speed execution. for example - loading a constant partially, performing some other operation and then finish loading the constant, which may obfuscate the code a bit.

it has been almost an year and i'm really sorry about that, but due to personal reasons and job work i've abandoned this completely and went to doing less engaging open source in the spare time...i can assure you though, that if you really want to get your hands dirty you should definitely try something of this magnitude and learn a lot.

one potential issue that drew me back a little was the recent (at the time) refactoring, which isn't much of problem, but more of a challenge.

the second issue was that even if targeting ARM is not that bad of an idea, usually the mobile vendors (which mostly decide on ARM due to efficiency of the platform) will apparently impose sandbox limitations that may disable this type of engine completely. so for example, even if the engine runs on Android it may not work at all on iOS, unless redesigned (of sorts and if possible). so the thing here is that you may not get a user base for the software you are writing.

unless we are very prescient and ARM suddenly decides to target the desktop, this might end up being only as a very nerdy mind-flex for developers, which certainly isn't a bad thing of course :].