So, as I've mentioned in the past, I went into the rabbit hole for a little bit to fundamentally reconsider the design of CEN64. I'm getting closer and closer to be able to popping my head out - here's a design document/notes that I've made for myself so far that some developers may appreciate! I've designed just about all of the system that I discuss so far... it's mostly just finishing the remaining questions I have to do before really going to town on this thing.

----------------
Introduction
----------------
So, at this time, I've effectively written two CEN64 interpreter cores.
Unfortunately, they're simply not fast enough, even with crazy amounts of
high-end hardware.
To address this issue with the third (and hopefully, final) write of CEN64,
I've researched two possible solutions that should allow CEN64 to run at full
speed on modest hardware:
* Run multiple interpreters in parallel (one for each pipeline, processor,
etc.) and design them such that each one can "commit" and "rollback" a
simulated cycle - effectively, a transactional memory-like approach.
Early analysis of this kind of approach demonstrated a prohibitively high
transactional "abort" rate. I found that, due to the N64's architecture,
there was a lot of aborting and stalling as components access RDRAM very
frequently, raise interrupts often enough, etc. And, since every piece of
the system is cycle-accurate, commit logs are incredibly expensive to
maintain due to the state of all the latches and everything that needs to
be maintained.
So, I'm more or less convinced that a transactional-style system that is
capable of leveraging multicore processing isn't the answer.
* Leverage dynamically-recompiled blocks. On a Haswell system, it's not
uncommon to see 3-3.5 IPC when the RDP isn't running. Unfortunately, a lot
of the instructions that are being executed are very, very predictable
conditional branches (but cannot be omitted as the cycle-accurate pipelines
still need to catch the uncommon case)!.
So, what one could do is to emit and somehow link together small, optimized
cycle-accurate pipeline models that have many conditional checks omitted.
Accuracy is still maintained because oftentimes, it can be determined that
many of the conditional checks are not necessary, depending on the program
counter or some other state of the system. Other common examples of checks
that can be omitted in the VR4300 pipeline alone: the data cache (DC) stage
can simply be made to forward the contents of the latches when the prior
cycle was not a memory instruction. The execute (EX) stage does not have to
check whether a COP (coprocessor unusable) exception should be raised, or
if FPU registers need to be accessed, when the instruction is an ordinary
integer instruction. Virtual address regions (uncached, cached and mapped,
etc.) can be determined ONCE depending on the virtual address of the block.
These are just a handful of examples where checks can be optimized out
simply by using an initial pass that analyzes the state of the system.
And so, this is the route that I've decided to take with the third write
of the CEN64 core.
----------
Design
----------
The heart and brains of the emulator run within a virtual-machine like context.
The goal of the system is to call a thunk and remain inside the context as much
as possible, exiting only to compile or perform some activity that cannot be
performed within the context itself.
Working inside a context gives us full reign over the hardware registers and
enables us to effectively ignore the host's calling convention. This means
that, for example, on x86_64, we can keep the entirety of the RSP's accumulator
registers and flags in native hardware registers, and still have half of our
vector hardware registers to spare!
Dynamically recompiled blocks of code can quickly be allocated and deallocated
using a custom slab-based allocator. Although the allocator has a fixed-sized
memory pool and probably has higher overhead than conventional memory allocation
algorithms, it's significantly faster than all libc malloc/free implementations
I've tried (10x faster than GNU libc). Moreover, there is no need to worry about
marking pages as executable for every malloc or allocation, since the allocator
reuses the same set of pages (which remain executable through the execution of
the emulator). Lastly, it coerces the system into using large pages as to reduce
the amount of page faults, even for large amounts of dynamically recompiled
code.
With an execution environment and allocator in place, the design questions that
remain are really the most interesting of the bunch: how does one efficiently
dynamically compile optimized cycle-accurate models (and link them together) by
some means?
To efficiently compile blocks, my current approach relies on the use of several
"templated" cycle accurate models with a hole in the middle of the model for
emulation the execution logic (an FPU add, an integer multiply, etc.). In this
way, optimizations of the templates can be focused on outside of runtime, and
assembly of a model at runtime is very efficient since it really only involves
selection of some templates and data movement. The selection is done simply
by leaving virtualized context for a brief period, taking the current state of
the system and feeding it to a selection algorithm.
With most questions of compilation itself sorted out (if you can even call it
that!), the only real question remaining is this: how does one link together
these models at runtime? Since each model has several cycles of 'assumptions'
baked into it, branching backward and forward really throws a wrench into the
mix because we need to do something while the pipeline is "primed" and we can
start executing our models again.
One option is to add additional checks to make each model (to make them more
generic), but that effectively cancels out the potential gain of the system,
so I'd rather avoid that if possible.
Another option is to compile and store paths for both branch directions along
with the simulated model for that cycle. Indirect branches will be a little
more cumbersome, but will still work as long as there are only a few potential
candidates for branching. In the event that this doesn't end up being the case,
or we have to start emulating a hardware trap/exception, we can run a generic
interpreter for a few cycles/instructions, and then jump back into the compiled
code.

* Run multiple interpreters in parallel (one for each pipeline, processor,
etc.) and design them such that each one can "commit" and "rollback" a
simulated cycle - effectively, a transactional memory-like approach.

Mmmmh very interesting. Tell me if I'm wrong but transactional memory in a multi thread context imply sync. How handle this while keeping performance?

So, I'm more or less convinced that a transactional-style system that is
capable of leveraging multicore processing isn't the answer.

Ok, so you say you DON'T want to do that. lol

So, what one could do is to emit and somehow link together small, optimized
cycle-accurate pipeline models that have many conditional checks omitted.

Seems to be the end of <90KB build.

I like the idea though. Have you any example of some expensive conditions we could skip and in which case? Is it impossible to start from the current Cen64 core to implement this?

Accuracy is still maintained because oftentimes, it can be determined that
many of the conditional checks are not necessary, depending on the program
counter or some other state of the system.

Any approach to handle "expensive conditions array"? Will you use a map?

Like a bit field:0100111. Each bit representing a particular condition that need to be checked (1) or not (0) and you generate your pipeline "queue" from it?

Other common examples of checks
that can be omitted in the VR4300 pipeline alone: the data cache (DC) stage
can simply be made to forward the contents of the latches when the prior
cycle was not a memory instruction. The execute (EX) stage does not have to
check whether a COP (coprocessor unusable) exception should be raised, or
if FPU registers need to be accessed, when the instruction is an ordinary
integer instruction.

In this particular example, you would disable EX and COP emulation and, because the behavior is predictive, manually update EX and COP registers?

And so, this is the route that I've decided to take with the third write
of the CEN64 core.

Look like a very good one.

About linking (I'm not very sure to totally understand), if your pipeline contain a limited number of stage (DC, EX, COP) you would have something like this:

This way, a "default" pipeline model (the current one) is 0,0,0 and you would have and an "optimized" one that could be 3,2,3, each "compiled block" assuming a certain amount of things.

Each of those "path" (3,2,3) can be mapped to a certain amount of initial states:
If instruction queue is blah, register X is blah, register Y is blah, go for 1,3,2. I have no idea how registers are stored in Cen64 but if registers are aligned, it can be as simple as a checksum:

Narann wrote:I like the idea though. Have you any example of some expensive conditions we could skip and in which case? Is it impossible to start from the current Cen64 core to implement this?

Well, the current core has to check everything, every cycle, without exception. See the "other example of checks" that I wrote about for some that are very uncommon. In addition to that list, another "uncommon" one is determining which register file to read the source registers from (when forwarding is not used) - does RS/RT reference the integer RF, or is it VS/VT referencing the COP1 RF? Most of the time, we are executing an integer instruction!

Narann wrote:Any approach to handle "expensive conditions array"? Will you use a map?

Basically arrays of pre-generated assembly code, one for each "kind" of cycle. Size shouldn't be a problem because most of the models or "kinds of cycles" will have assumptions removed and be shorter as a result. If one case is too large, then I can just use a more-generic one in place of it.

As for selecting which array: when a (uncompiled) cycle is first encountered, a "interpreter" like cycle will run that also notes all of the assumptions and selects the correct code path at the end of the cycle (using a regular index, I suppose). No map or anything like that should be necessary.

Narann wrote:In this particular example, you would disable EX and COP emulation and, because the behavior is predictive, manually update EX and COP registers?

Most of the time, COP registers don't need to be updated at all -- the only real exception being COP0::Count. EX logic would work as normal, but the fact that an exception doesn't need to be checked for saves the CPU a good number of instructions of potential work (otherwise, it has to check if COP0::Status has usable bit set for relevant coprocessor, which mode the CPU is it according to COP0::Status, etc.).

Narann wrote:About linking (I'm not very sure to totally understand), if your pipeline contain a limited number of stage (DC, EX, COP) you would have something like this:

This way, a "default" pipeline model (the current one) is 0,0,0 and you would have and an "optimized" one that could be 3,2,3, each "compiled block" assuming a certain amount of things.

Each of those "path" (3,2,3) can be mapped to a certain amount of initial states:
If instruction queue is blah, register X is blah, register Y is blah, go for 1,3,2. I have no idea how registers are stored in Cen64 but if registers are aligned, it can be as simple as a checksum:

Unfortunately, pipeline will always have 5 stages! And most of the time, they do useful work.

If I were to separate the function for each part of each stage (I think this is what you mean), there would be a lot of branching overhead, which would add a lot of performance overhead for stages like WB (unless the branches are predictable... but I don't think they will be) that normally just copy the result from the latch into the RF using the index given by the latch. Only very infrequently does it "slip" because of a COP0 restriction.

If you think about it though, this is what I'm doing on the component level: where instead of a tuple being used to express the kind of cycle within a component, the tuple is used to express the state of the current component (RSP, VR4300, etc.).

I hope that makes sense, because I definitely didn't work that well at all.

It looks like a good approach for a good performance increase, MarathonMan. But it looks quite complicated to get it working. I'm just thinking this: maybe it would be easier to implement full recompilation of the whole ROM before running a game, doing this recompilation in a way that the code returns to the idle loop each time a cycle ends (or every N cycles, depending on the cycle granularity that you're using, which I don't know). The only times when the recompiler would be invoked would be when loading a new ROM and when a game loads microcode into the RDP.

asiga wrote:maybe it would be easier to implement full recompilation of the whole ROM before running a game, doing this recompilation in a way that the code returns to the idle loop each time a cycle ends

I wonder how much time it would take to the compiler to recompile the whole ROM.

But I like the idea. It would simplify the whole design: One code to recompile, one code to run recompiled code.

The problem with a static compiler is that it would only work for certain games. A lot of cart can, and do, move code around, uncompress code, etc. ... this is why it has to be done dynamically. If it weren't for that, I'd totally agree that ahead-of time compilation would be the way to go.

I've told a few people about where I'm planning to go with the project, but I still get someone that asks me what's up every once in awhile. So, I'll just leave this here so I don't want to repeat myself.

Most of my time lately has been working on a new concept (more on this later). In the meantime, I've been spending some time on and off optimizing the current core to see how much I could squeeze out of it. In doing so, I've completely made up my mind: an interpreter-based CEN64 is simply intractable as far as achieving realtime performance -- even on very high-end hardware. And mind you, this is with an instruction-level accurate RDP! Please, someone prove me wrong!

I've tried quite a few things to address this:

I had some (~5% VI/s increase) success in vectorizing the RDP by taking advantages of locations where RGBA or STWZ pixels and spans are processed in parallel, etc.

I split the RCP and VR4300 into two threads, while sacrificing some accuracy in the process. Visually, this appears to have been a pretty big win and not a whole lot of accuracy is sacrificed. Still too slow, though.

I have tried splitting the RCP further, into an RSP and RDP thread and had limited success. Lots of titles, including Super Smash Bros., will run at 60VI/s -- but overall, there is a huge hit to accuracy and things in general just don't work.

I've profiled things to death. It's very evident that many ROMs will target one component of the system. So, if you want good coverage across the whole ROM set, everything (VR4300, RSP, RDP) needs to be really optimized.

Now going off that last bullet: in speaking to developers in the community, it seems that everyone else has come up with the same findings. Lots of exciting work is going on to address it. My personal approach will be to take the design I originally posted about in this thread and give it a twist.

I was kind of uncertain about some of the parts of the design as I began to move forward. At one point, it hit me: all of the CPUs (VR4300, RSP, RDP) inside the N64 use an in-order architecture. I realized that I could leverage this inherent property to my advantage as far as performance is concerned. The idea is such: when a CPU stalls, the pipeline freezes until a condition is met (whether it be waiting on a RDRAM access, or that multicycle FPU operation to complete). When this occurs, I can simply use coroutines to eject from the current location and branch to the next JIT block. On the next cycle to the stalled component, instead of re-entering a quasi-generic pipeline model like I do now, I can branch back directly into the spot (rather, into the coroutine) where the stall occurred and perform an extremely light-weight check (and continue if needed).

All this gets better, though. Since CEN64 will eventually have it's own VM, I effectively control the calling convention and can play all sorts of games with the register allocator to "teach" it about these eject points. I can also do things like statically reserve some registers (say, 8 XMM registers on x86_64 for RSP accumulators and clips), so that there is a very minimal amount of spilling and reloading acquired at many of these points. I've also thought about statically reserving one or two registers per component so that each CPU core has one or two registers devoted to it to further reduce spills and reloads. The opportunities are endless.

I think with all these ideas mashed together, a cycle-accurate dynamic recompiler should be tractable (however ugly).

I have begun writing a compiler to move forward with all these idea. The compiler will compile the initial interpreter and its guts will double as a JIT infrastructure. The interpreter will profile for hot sections of code and flag them for compilation. The JIT compilers (running in separate threads, as to not impact emulation) will pick up these hints and compile code for the hot spots.

Hopefully, I'll get it right this time. I'm very excited and optimistic that this will finally give me the headroom I need to add all the checks cycle-accuracy requires while delivering 60VI/s.

Thanks for keeping us posted. This look quite ugly as you say but it seems to be very prompt to efficiency. I think this has never been tried on the n64 scene.

MarathonMan wrote:The interpreter will profile for hot sections of code and flag them for compilation.

I wonder about the cost of this during gameplay. I wonder if it wouldn't be even better to allow the emulator to save (as cache file) already known hot sections/recompiled sections so it doesn't have to hang each time you restart a game.

I'm very excited by what you say, I suggest to keep all informations about what you say about the design. The point is that it will maybe be code which is hard to get so any hint on how the whole works is interesting for newcomer (or future contributors).

A JIT cache that exists between executions is something I've thought about eventually adding as an incremental thing. In the meantime, small blobs of codes that get run several times over like the RSP ucodes, main engine and libultra code, etc. will immediately benefit without a JIT cache that lives between runs.

Snowstorm64 wrote:This sounds promising! But I wonder... what about the actual core? Can it be recycled or at least are some parts of it? It would be sad to see all this work on the actual core wasted...

I'll be able reuse most of it... I just have to transcribe it into the new language. I think the idea of what I have hasn't come through very clear yet, so let me explain a little further.

The compiler inside CEN64 will be very different than anything you'd typically see in a HLE. Most HLE compilers are not compilers at all - they're just binary translators. The take an input language (MIPS binary) and convert in into an output language (x86 binary). They often use neat tricks like register caching during the conversion process to make the output code cleaner.

CEN64, on the other hand, will feature a full-blown, optimizing, source-to-binary compiler and managed runtime. There are a few reasons for this madness:

One being that, because of the nature of a compiler, it is likely that I will encounter a lot of bugs along the way, which makes writing a cycle-accurate emulator that much more difficult. I'm designing the CEN64 compiler so I can run it in a standalone mode and perform regression testing outside of the emulator. Especially as I begin developing passes and getting into more of the compiler backend, this will help to catch a lot of bugs and allow me to pinpoint things quicker since I can just run a regression suite as a git post-commit hook.

Secondly, and probably of more importance, is that I need to interleave simulation code alongside the translated MIPS instructions. The simulation code that will get weaved in with the instructions will vary greatly in size and complexity depending on what parts of the CPU pipeline the MIPS code is exercising. Trying to interleave simulation code alongside the output of a binary compiler would be nothing short of a complete disaster and certainly result in suboptimal code being generated. A compiler, on the other hand, can pragmatically take in all this input and optimize across all of it in a generic way. Once I spot additional opportunities for optimization, I don't need to dig around in the generated binary recompiler code... I can just write another compiler pass that further optimizes the graphs, perhaps taking some assumptions as input.

Lastly, looking forward, keeping the compiler quasi-generic and extensible means that there is a greater degree of reusability. There is nothing that ties the compiler to strictly N64 emulation (rather, all of the specific logic is deferred to compiler passes or the runtime). If and when the day comes, this could become a framework for either myself or others to build very fast and efficient cycle-accurate simulators.

Anyways, with all that out of the way, I can reveal a bit about the compiler as it stands today so you can see how translating from the C interpreters will be quite easy.

That's it. The language has variables, function calls, binary expressions, conditional branching, and some niceties like comments and include directives. Nothing super fancy here. But if you look at vr4300/pipeline.c, you will see that, quite frankly, that's all that code is really doing. Adding anything else, such as pointers, would only serve to complicate the compilation and optimization process.

It then lowers the graph to x86_64 binary code optimized for your CPU. The days of seperate SSE2/SSSE3/SSE4.1/AVX/Native builds are gone... there will be only one portable binary that is capable of generating optimized code for the host CPU.

So hopefully now, you can see that I plan to reuse almost all of the logic in the interpreter cores. I just have to transcribe it into the new language.

After I get the interpreters running, I can then use them to actively profile for hot sections of MIPS code and flag it for execution. A JIT thread will pickup these hints and compile very optimized code for those segments using the existing compiler infrastructure (the interpreters will continue to be used for everything else - no need to compile the world).

After reading your post multiple times, I think that's a genius way to achieve accurate emulation while delivering full speed. The only thing that concerns me is that it looks too much awesome to be true....are there any downsides (other than translating the code in the new language)?!
I hope you'll finally manage to make the dream of the perfect emulation come true this time.

Narann wrote:One that could change the face of CAE (Cycle Accurate Emulation, let's create a new acronym).

Yes, cycle-accuracy for all! I'm not certain that it will have enough oomph for embedded use cases (at least N64 - I could see SNES being a thing), but time will tell.

Snowstorm64 wrote:are there any downsides (other than translating the code in the new language)?!

Sure, I think there are some.

Firstly, I'm not going to beat around the bush: the language wasn't designed around elegance -- it's a really ugly language at best and has very limited use cases. One of the big reasons for why we emulate things is preservation, and it kinda sucks that something like this will be used to preserve what the hardware does.

Secondly, the design implies some kind of 'lock-in' to an environment that can support all the requirements of the compiler and runtime. Any architecture or OS that wishes to benefit from the emulation must port everything (in the case of a different architecture, this means writing a new backend). This may not seem like a huge deal from the outset, but it does prevent the emulator from running on things like iOS (which does not permit programmers to acquire executable pages).

It's an permissions limitation or an OS limitation? I mean, if you root it, can you do it? And if so, would you break something? If the answer is Yes then No you should not bother with sofware limited hardware. Those low level stuff are never "nice" (is there any emulator dynarec working on more than one architecture being nicely coded?). From what I've seen, documentation (comments, big overview) is often the only way to go. Try to make the global code nice and locate what must be dirty in some specific, separated locations.

Firstly, I'm not going to beat around the bush: the language wasn't designed around elegance -- it's a really ugly language at best and has very limited use cases. One of the big reasons for why we emulate things is preservation, and it kinda sucks that something like this will be used to preserve what the hardware does.

Secondly, the design implies some kind of 'lock-in' to an environment that can support all the requirements of the compiler and runtime. Any architecture or OS that wishes to benefit from the emulation must port everything (in the case of a different architecture, this means writing a new backend). This may not seem like a huge deal from the outset, but it does prevent the emulator from running on things like iOS (which does not permit programmers to acquire executable pages).

If I understood what you mean, then...well, to be fair, it's not like actual CEN64's code is readable/portable enough because of the heavy use of SSE/AVX intrinsics...theoretically an ANSI C-compliant code would be ideal, but because we need to achieve cycle-accurate emulation at an acceptable speed (and also think of code portability!), the use of SSE intrinsics is a necessary evil in order to do that. So if you have to write the new core, as you have just described before, in order to deliver perfect CAE at good speed...just do it! Maybe one day we'll have computers that will be powerful enough to do perfect CAE without any of those compromises, but for now let's preserve everything before all the N64 consoles begin to die.

I can attest to that via the most modern use of Grade A Nintendium - the Wii remote; I had one a couple years ago that seems to have randomly died on me even though the remote itself looks perfectly fine (I opened it up and everything - it looks pristine).

Whenever I pick this up, I always end up doing something other than what I planned. Ugh.

Most recently, I have bolstered the semantic analysis, which is quite boring if I don't say so myself. There is now type checking (TODO: implicit casting), improved variable parsing and handling, and more. Of course, because I love to over-optimize things, the memory requirements are further reduced and the compiler itself got a little speed boost.

Hopefully there's just one last minor thing blocking me from really getting into the meat of SSA construction. After SSA construction is finished, I should have enough pieces constructed to start compiling and executing generic programs. After that, it's just optimization passes and further improvements to semantic analysis (probably along with new language constructs at some point in time...)

Am I right in thinking that this language/compiler has, in its foundation, similar principles as Java/C#, but is very specialized and more low level to be able apply tricks like queuing faster versions of code blocks when the conditions allow for them? Or are there too many optimizations or differences in general that the comparison is just silly?