Yes, that would be *really* interesting although I wouldn't know where to start

Cool. Can you IM me? I'm on Yahoo as rreyelts.

Quote

I did have a look at a site which explains how to do static binary translation (so the original machine code is translated to java, and removing dead flags and such), and that seems to me the way to speed up CPU emulation the most, but I'm not sure (not knowing much about dynamic recompilation if anything)...

I think static retargetting would probably be ideal, but I think it's probably easier to do dynamic recompilation - it may also be more amenable to some of the tricks certain programs play.

Quote

When I look at the profiler output, CPU emulation is quite expensive (although it's only ~5-15% on my PC, but that's a quite fast PC), but video emulation is even more expensive in many games... I guess because of lots of array access.

Hmm... It sounds like you're not offloading any work onto the graphics processor. I guess that makes sense for the earlier programs, but I bet that the burden shifts in later systems/programs where you can discern graphics primitives calls.

Quote

When I would know how to do 'dynarec' or static binary translation, I could have a go at emulating an 68000.

I think static retargetting would probably be ideal, but I think it's probably easier to do dynamic recompilation - it may also be more amenable to some of the tricks certain programs play.

Yeah, I think you're probably right.

Quote

Hmm... It sounds like you're not offloading any work onto the graphics processor. I guess that makes sense for the earlier programs, but I bet that the burden shifts in later systems/programs where you can discern graphics primitives calls.

All graphics emulation is done in software inside the CPU. Only the final frame is rendered using openGL. I think there's a big speed gain as well although the current method might be more accurate.

Quote

You mean cause it would run too slow otherwise?

Well, look at it this way: The 68k has a *huge* number of instruction and it would not make sense to do every single instruction in a huge switch/case. It might make more sense to do it more programatically, and use dynamic recompilation to get some performance out of it.I've seen one 68k emulator done in java (used in an Atari ST emulator) and that one was done by generating a java source with huge switch/cases. The source of that generated 68k emulator was about 4Mb large...

All graphics emulation is done in software inside the CPU. Only the final frame is rendered using openGL. I think there's a big speed gain as well although the current method might be more accurate.

The big advantage for graphics would be in emulating hardware sprites with OpenGL quads or something. You could also emulate hardware scrolling registers and that sort of thing. Even hardware playfield layers like the Amiga had.

For 68k emulation the trick would be to decompose the instruction word.. they have a fairly regular structure, so you could do a much smaller switch on the bits that define the instruction type, then extract the register IDs from the other bits etc...

The big advantage for graphics would be in emulating hardware sprites with OpenGL quads or something. You could also emulate hardware scrolling registers and that sort of thing. Even hardware playfield layers like the Amiga had.

Exactly.

Quote

For 68k emulation the trick would be to decompose the instruction word.. they have a fairly regular structure, so you could do a much smaller switch on the bits that define the instruction type, then extract the register IDs from the other bits etc...

This Z80 instruction set is a farking nightmare in that regard. No consistency whatsoever from an opcode point of view. Anyway, I'm still not sure why people don't dispatch via an array of interfaces instead, i.e:

interface OpcodeHandler { public void execute();}

OpcodeHandler[] opcodeHandlers = new OpcodeHandler[] { new Op0Handler(), new Op1Handler(), new Op2Handler(), ...};

Erik had to already pull all of the inline code out of the switch statements into functions (it performed worse), so I can't see how an array dereference + virtual lookup is going to be so much more expensive than a switch. Especially considering that the dynamic recompiler is going to strive to minimize these loop iterations anyway.

...so I can't see how an array dereference + virtual lookup is going to be so much more expensive than a switch.

Theoretically, the switch could be implemented internally by the compiler as a lookup anyway. It might even be able to optimize the case for no match so that it overlaps with the time that would have been taken to do the array bounds check in your example.

Practically speaking this is worth profiling in a real-world case on different VMs. And you can of course disassemble the bytecode to see what javac does with it. Maybe it isn't so easy to optimize with the JIT compiler.

Theoretically, the switch could be implemented internally by the compiler as a lookup anyway.

I don't know what you mean. The instruction is tableswitch. It is indeed a table lookup. It's part of the VM spec. The JIT implementation should be straightforward. The point is that you still pay a branch penalty (what really hurts the modern processors today), and you still pay a function call penalty, because you end up outlining all of the code anyway.

One of the greatest benefits of dynarec is that you don't pay a branch penalty per instruction anymore.

Erik had to already pull all of the inline code out of the switch statements into functions (it performed worse)

Yeah, manual inlining was a stupid thing to do although on the MSVM it did got faster that way and pulling the inline code caused a performance hit on that VM.

Quote

For 68k emulation the trick would be to decompose the instruction word.. they have a fairly regular structure, so you could do a much smaller switch on the bits that define the instruction type, then extract the register IDs from the other bits etc...

Yes, that is the smarter but slower way to do it. I started an 68k emulator that way once but at the time I wouldn't be able to get it fast enough on my machine I had then.

Quote

The big advantage for graphics would be in emulating hardware sprites with OpenGL quads or something. You could also emulate hardware scrolling registers and that sort of thing. Even hardware playfield layers like the Amiga had.

Yes, a major speed gain would be possible. The thing is when you use filtering that you can see the edges of the sprites/characters which doesn't look very good. i.e. when 2 sprites or characters are directly next to eachother, the pixels of the outer edge of one quad are not filtered with the pixels next to them from the other quad so you see a sharp line between them.So I think for best looking results, I should not render the quads directly, but blit them to another texture first (one texture per video layer). It's all doable, but not really trivial to rewrite the current renderer that way.If only bounds check removal would be implemented (in the client), that would cause a huge performance boost as well.

I don't know what you mean. The instruction is tableswitch. It is indeed a table lookup.

Well I don't know much about the VM spec/bytecodes.. but I was assuming that a table lookup, and by that I meant using the switch value as an index into a table of function pointers, would be a bad idea in some cases, as the switch values could be very sparse and the table size would need to be 4 gig.

Quote

The point is that you still pay a branch penalty (what really hurts the modern processors today), and you still pay a function call penalty, because you end up outlining all of the code anyway.

So you are saying that there are effectively two branches in the switch case? One in the switch decision and another that is part of the function call? (the call being unconditional, so not nearly as bad for the processor's instruction pipeline)So the theory is that the array lookup only has the one branch for the call that is always taken. If I have that right, I was supposing that the JIT of the 'tableswitch' would end up producing machine code that effectively did the same thing, given the nature of the switch statement it could be a special case that the JIT compiler was able to convert into the native equivalent of the array of interfaces.

Quote

One of the greatest benefits of dynarec is that you don't pay a branch penalty per instruction anymore.

Yep, obviously JIT compiling is the proof that it works. The emulator after all is nothing more than a VM with the virtual machine opcodes (byte code) being instruction codes of an actual processor. I guess that makes it a NVM (non-virtual machine)?

The scary thing is dealing with the practice of self-modifying code in an environment with dynamic re-compilation. Something that was fairly common in that day. I guess arcade machine ROM based programs would have less of that, so you are at an advantage there.

Well, with dynarec you will get rid of loads of branches and I think hotspot will have a much easier job removing dead flags and other optimizations. I think it will make a difference, although the difference might not be as big as when doing it in C (static binary translation in C with dead flag removal has been reported to get as much as 25-30x faster than interpretation.).Maybe for the Z80 emulation, it might not be too important on recent PC's since a Z80 doesn't run on high clock speeds anyway. But once it works, the same technique can be applied to 68k emulation and there it will matter.

There are of course more optimizations possible in other areas (mainly rendering), but I'm looking at that as well.

Sounds like it will all get far too complex for a very very minimal gain...

I don't think you know what you're talking about Cas. You translate X number of impossible to optimize interpreter loops into a single optimizable function. That should be a huge win. I could easily see it running 10x faster. As far as it being "far too complex", I already almost have a working prototype ready.

Well, there are arcades which run as much as 5 Z80's simultaneously. If we can keep the system requirements down that would be welcome.And once we (well, Toby that is :-)) got it working, the same technique can be applied to faster CPU's and there it will surely be important.

Immediately after the ROM is loaded, I walk the Z80 machine instructions looking for a window of instructions that is uninterrupted by jumps. Once I have that window established, I generate Java bytecode that is the equivalent to what the interpreter would run, given the same sequence of instructions. I then replace the first machine instruction of that window in ROM with a "dynarecTrap instruction" which has an opcode that falls outside the valid set of Z80 opcodes. The instruction contains a pointer to the Java bytecode I just generated. So, when the interpreter is running, if it sees a dynarecTrap, it runs the corresponding Java bytecode, instead of looping the series of machine instructions it would have normally.

This should optimize out really well, because it takes a set of instructions that would have normally required a dynamic branch (for each instruction) to execute, and instead places them back to back, where you can, all of the sudden, perform local optimizations like inlining, dead flag detection, redundant instruction collapsing (i.e. multiple updates to the PC or cycle counter), etc...

Maybe you can flag 256 byte or 1k pages in memory as being recently written to and upon attempting to execute code in such a page, run it through the compiler?

I think there's some miscommunication.

The point I'm trying to make, is that I can't translate the machine code, unless I can follow the jumps. This is because the bits immediately following a jump may be anything - data/garbage/whatever. Once I hit that, there's no way to continue. I considered trying to statically follow the jumps, but this seems nearly impossible for half the jumps, which do things like jump to an address stored in a register. (You can't even do clever stuff, like look for a CALL before a RET, because the crazy programmer may have done something insane like do a RET without having ever executed a CALL). So the only recourse I have is to track instructions as they are executed.

This is annoying compared to something like a Java class file, where there is no difficulty in understanding what parts are data and what parts are bytecode. It gets even nicer than that, because Java mandates that you do have to be able to statically analyse where a program can jump to.

java-gaming.org is not responsible for the content posted by its members, including references to external websites,
and other references that may or may not have a relation with our primarily
gaming and game production oriented community.
inquiries and complaints can be sent via email to the info‑account of the
company managing the website of java‑gaming.org