(Hope this is okay to post here, also posted to NESdev for more feedback.)

I know I'm a bit of an outlier here with using cooperative-threading to design emulators, and that most will use state machines that run each chip the minimum amount of time, but I ran into a rather interesting challenge with threads, and was curious if anyone here had some input.

The Sega Genesis + Sega CD turns out to be a bit of a nightmare of a design to emulate.

You have two 68Ks, a Z80, a VDP graphics chip, PSG audio, YM2612 audio. And tons of components in the Sega CD that I basically roll into its 68K: a graphics rotozoom ASIC, CD controller, CD driver, CD-DA playback, etc. Then there's controllers that may need threads of their own, cartridges as well, etc.

The way my scheduler works is, every time a thread is running, and it does something that takes time, I increment that thread's current clock cycle. There's some extra math in there to scale the value so that each thread can run at different clock rates.

Many chips share RAM and resources, and so you need to make sure the current thread doesn't perform an operation that could affect the other chips while it's ahead in time. So every time the CPU accesses a PSG register, it has to catch up the Z80 *and* the PSG, since the Z80 can also access the PSG. The same for the YM2612. In fact, the Z80 can access almost the entire bus of the 68K, and vice versa, so it ends up being a lot of synchronization.

Once every emulated frame, I find the smallest clock value of all the threads, and then subtract that value from every thread's clock value, which prevents the clock counters from ever overflowing.

So notice how the YM2612 can't modify the PSG. As a result, it doesn't need to sync the PSG right now and can happily keep running even if it's ahead of the PSG, so long as it doesn't run so long that the PSG runs out of buffered audio samples.

The design works well, but with one glaring flaw: let's say the APU runs for a few clock cycles, and calls step. It is currently behind the CPU, but it is ahead of the YM2612. So it switches to it. The YM2612 runs for quite a while (my core is sample-based, so it eats 144 clock cycles per sample), and now it's quite a ways ahead of the CPU, so then it switches over to the CPU. BUT!! The CPU is still behind the APU, and now the CPU goes on to perform a write that affects the APU.

The naive solution to that is: every time any thread is about to do something that would influence another processor, we scan through every single thread, and find the thread with the smallest clock value, and if that's not the current thread, switch to it.

But that means we have to loop over a list of threads, compare their clock values, and keep track of which thread has the smallest value. And when we're the YM2612, that means synchronizing the PSG potentially, even though we usually don't care about that. And synchronizations are *painful*, so this approach ends up costing me about 20% more performance and I absolutely cannot spare a hit that big anymore.

My initial thought was something like a binary min-heap array for the threads, so I can find min(list) in O(1) time, but that doesn't solve the problem with synchronizing threads that aren't important.

I can also unroll the synchronization logic, for instance in the YM2612's case:

Which is an annoying increase of code that I can hide in a library function, but mostly works ... until you have three threads. Or four. Or in the case of the Genesis CPU, the APU+PSG+YM2612+VDP+Mega CD+Controller 1+Controller 2. Unrolling all of that is ... not for the faint of heart. And not even possible in some instances: if you don't have controller 2 plugged in, there is no controller 2 thread.

So right now, the naive approach seems to be the most likely to work. But as it's way too slow, I was curious if anyone had better ideas?

I don't have time to do this justice right now, but I'll get back to you on this soon. My problem here will be reducing my comments from book length. When I designed Exodus, I deliberately took different approaches on a lot of things, with the threading and execution models being no exception. I can share with you a bit of a postmortem on the relevant aspects, what worked and what didn't, and what I plan to change in the future. There are a few things in there that I think will be directly relevant for you. You might also be able to offer me a few tips based on what's worked for you in the past.

One thing to remember is that nobody has really pulled this off yet. As you point out, the Mega Drive + MegaCD is a bit of a monster. Now throw the 32x on top (remember, there are MegaCD32x games). The only emulators that do that right now totally approximate the timing, and/or have game specific hacks. Cycle accurate emulation with this many active devices is hard, as you know. I tried to design Exodus to scale to this system and beyond. Some things worked, others not so much. I'll cover that when I get a chance. I'm away for the week, but I'll find some time to share my thoughts next week.

Hey cool, take your time if you're busy. Whenever you're up for it, I'd love to have a more thorough discussion.

I remember comparing my cooprerative approach to your preemptive approach in the past, but comparing the SNES to the Genesis was always rather imprecise anyway. And throwing in the Sega CD (and heaven forbid, the CD-32X) as well, makes it all the more interesting.

I agree this is definitely not a solved problem by any means. Our forebears had the same concerns, _Demo_ with the SNES/SA-1, and Steve Snake with the MD/Mega CD.

I thought of another way to describe the issue my old scheduler was running into more succinctly:

Thread A is about to do something, and needs threads B and C to be caught up before proceeding. A jumps to (not calls, eg this is asynchronous and not synchronous) thread B, because B is behind in time. B runs for a bit, and B needs to do something where thread D is caught up first, so B jumps to D this time. D does something, but only needs A to be caught up, which it is, so it jumps back to A. But the problem is that B only ran for a very short while, and so when D jumps back to A, A resumes execution even though B may not have caught up yet.

Another way to solve this could be to turn this:
* if(threadA.clock() >= threadB.clock()) threadB.resume();
Into this:
* while(threadA.clock() >= threadB.clock()) threadB.resume();

That should all but guarantee us that B is caught up before proceeding.
EDIT: and it does. Fun. Well, at least I ended up doing a lot of code cleanup and boilerplate removal.

Thanks for bring this subject up, it’s really interesting. My major point is I need more time to think about all this because bring a lot of knowledge altogether. But so far my developing thoughts are:

General theory says you should make a difference between a thread and a workload: threads should be ONLY execution units, completely uncoupled with data, so any thread could be executing any workload indistinctly. A part from watch-dogs threads, you should have as many threads as physical execution units in you system.

CPUs are designed to work alone with and maximize usage of memory (read here ram or cache whatever it corresponds) because if you put more than one cpu on the same bus performance goes down the 50% barrier, so to communicate with other devices all sort of protocols are devised, this gives us the opportunity to use this protocols as a stop and start point for the workloads.

I think it will be also a great exercise to write the single thread version first, or at least a sketch, to not be too much lost on the details off the multithread implementation and to understand exactly what we want to accomplish.

the APU+PSG+YM2612+VDP+Mega CD+Controller 1+Controller 2. Unrolling all of that is ... not for the faint of heart. And not even possible in some instances: if you don't have controller 2 plugged in, there is no controller 2 thread.

A workload for the controllers seems a bit overdone, is there any special reason for that?

And finally this kind of development deserves a rant for current cpu design, and a overview of the future to not be too obsessed with multithread implementation, probably on next post.

General theory says you should make a difference between a thread and a workload:

I'm really an oddball here. My threads are actual cooperative threads, each with their own stack frames.

But there's no pre-emption. One threads runs until it needs state from another thread that's behind, and then it will jump to the other thread, swapping out the stack frame and all.

The result is an emulator that is *almost* devoid of state machines. I only really draw the line at things like flash chips. In an ideal world with unlimited CPU power, I'd absolutely have those be threaded as well, but ... a core goal is that all my emulators have to run full speed on my box (even if my box is a very expensive one.)

To the real host CPU though, my emulators run on a single actual thread, on a single actual CPU core (well, unless you're FreeBSD and like to constantly migrate threads between CPUs >_>)

A workload for the controllers seems a bit overdone, is there any special reason for that?

So far, I use it for latching the Super Scope and Justifier light guns at the right point in time instead of once per frame, and I also use it for the Genesis Fighting Pad timeout.

I have used it in the past and likely will again for UARTs (eg asynchronous serial bit-banging protocols like modems.)

Ordinary controllers in the current WIP are not threaded.

And finally this kind of development deserves a rant for current cpu design, and a overview of the future to not be too obsessed with multithread implementation, probably on next post.

The problem with single-threaded emulators is that single thread performance has stagnated over the last decade or so. Other than tiny improvements in IPC efficiency, virtually all work have been towards adding more and more threads in parallel. We've seen a huge shift in game programming in the same decade, from single-threaded games to ones that make full use of all threads. Emulations must do the same to improve accuracy while still performing well on current systems.

I have never done low level programming and I'm completely out of my depth here, but one time I contemplated, just for the sake of argument, how emulating a multiple devices down to a cycle could work... Very possible that I'm completely wrong here, but couldn't you keep an array of variables in a core thread, that would tell which part of the hardware is busy and at what cycle each chip is at, and use that as a basis for timing and cross-chip access?

For example, you have a chip running at 8 MHz, that means it executes 8 million cycles a second, so every time it does something, you increment an "elapsed time" variable for that device by (cycles the operation takes) * (1 / clockspeed). So an operation that takes 16 cycles at 8MHz would increment the elapsed time by 0,000002. If it locks Bus A, you again note that in a "global" variable next to the "elapsed time". Then you proceed to the next device with the lowest "elapsed time" count, and do an operation there. If this next device wants to access Bus A, then we will know that it can't because we can check the variable that tells us whether it is accessed or not, and then we can execute whatever the chip does in this case (like check the bus every cycle or take priority over the other device or whatever).

Of course for this you also need a specific order as to which chip takes priority to which bus / memory, which then gets more complex because there might be exceptions (like, chip C has last access to everything, except memory A, where it takes priority over everything)… and so on.
This of course assumes that you know exactly how many cycles each operation on a chip takes, otherwise you get a huge mess until you at least approximate the proper cycle lengths, which can take months of testing or even trial and error.

Every few seconds of execution, we can compare the device clocks and shave off a set value from each of them, to prevent our time counter to get too large and overflow. Or perhaps whenever we hit the one second mark for elapsed time on one device, we check all devices' clocks to see if they are above 1 second, and if we are, we subtract 1 second from all numbers (that way we only need to check all values as many times per emulated second as we have devices, which is still faster than doing it on every time the time counter advances, and more reliable than doing it at random intervals).

I feel that this would impose a significant amount of wasted CPU time, since every emulated operation hangs on how fast we can access our time counter. At worst we'd be bottlenecked by memory speed, at best, by shared CPU cache speed (L3 is shared on all cores in multicore x86 CPUs, are they not?). So on hindsight, this approach would be more suitable for a single threaded emulator, because there's extremely little that each thread could do, on its own, before the need arises to look into our time counter. Now that I think about it, I suppose this is why spreading workloads over multiple cpu threads is a challenge entirely of its own.

Please consider that I never wrote anything nearly as complex and timing sensitive as an emulator, and this may be decades old information for those of you who did. This was just my own, personal brainstorming without even looking up any reading material on how to implement software emulation, let alone on how to do multi-core scheduling. It's probably really useless to you guys.

What you're describing is pretty close to what I'm doing, with a few key differences.

Instead of a global array of timestamps, each class that inherits from Thread gets its own timestamp and support functions. There's a scheduler class that automatically keeps track of every thread in existence.

You set a frequency for the thread, and it's a fixed-point version of what you mentioned: I allow up to one second of time to pass before the counters would overflow in a 128-bit integer, and so I scale the chip frequency to 2^127. That gives us timing accuracy that's far beyond the most accurate time keeping clocks in the world (64-bit integers and 2^63 would be significantly less.) Something more precise than attoseconds. If we emulated systems with only one oscillator, this would be entirely unnecessary and 100% timing accuracy would be trivial. But the more oscillators you throw in, the more difficult it gets without doing this scaling.

Every emulated frame, I subtract the smallest timestamp from every thread, so that they never overflow.

My original proposal in this thread was basically to just force-sync whatever is the furthest behind always, but that was incurring a lot of unnecessary speed penalties. About 20-40% depending on the emulator.

The optimization I'm using is: the Genesis YM2612 needs to synchronize to the CPU and APU, but doesn't care if it's ahead of the PSG, VDP, Mega CD, etc since those can't access the YM2612 anyway. By only syncing what we need, we reduce the number of context switches by a lot.

... but it's still really painful. Preemptive thread synchronizations may be worse, but cooperative thread switching still absolutely murders modern CPUs. They *really* don't like their stack frames being changed.

About the timing of the hole machine, what I would do is the minimum common multiple of every clock involved as the master clock; and the adjust to every device through division/multiplication. Which what I do here.

Sorry, I meant to come back to this much sooner. A lot of good stuff has already been discussed so far. My reply is going to be a bit long, so apologies in advance. What started out as a simple reply has morphed into a kind of full introduction to the problem for the uninitiated. If you've already written an emulator, there's a lot of stuff here you can probably skip, but for those who haven't I'm going to share a definition of the problem, to make sure anyone who might read this is on the same page, as I doubt anyone who hasn't tried to write an emulator really appreciates all the difficulties here. I don't think what I'm going to say will contain any earth shattering revelations, but I'll try and define the problem clearly, then cover the things I learned when working on Exodus.

In real hardware, the systems we're trying to emulate (usually) have multiple "active" devices, and they run in a truly parallel fashion. The timing of interactions that occur in this system are inherent. If you have two devices, both running off a shared/related clock source, both with fixed known timing charateristics, they'll interact in a predictable way. They may interact millions of times a second, with their operations interleaving in a consistent fashion. Emulating a single active device is relatively easy. Emulating multiple active devices at once, and getting absolutely every interaction between them in the right order with the correct relative timing is much, much harder.

It's worth noting at this point, that historically most emulators haven't really tried to solve this problem. Genecyst could run Mega Drive games at full speed on a Pentium 133. How could it do this? One reason was that, like most emulators after it, it didn't attempt to get perfect timing accuracy. Although a bit of a simplification, it would basically just say "run the 68000 for X amount of time, then run the Z80 for X amount of time, then run the 68000 for another X, etc". The timing accuracy was limited to saying the two devices are "within X of each other" at any point in time. As it happens, that's enough to get a lot of code working. Why? Because the timing requirements often depend on the actual code being run within the system you're emulating. You can write code that'll work even if the timing of the system is massively off. You can also write code that requires bleeding-edge timing cases to be emulated perfectly, or it all falls down in a massive heap. If you want to write an emulator that plays some (or even most) games on a given system with the apparent correct result as far as the player can see, and you don't care about more than that, you can do some really simple nieve things and get what you want. If you want to write an emulator that runs all code exactly and precisely in the same way as the original system, such that nobody can write any code that can behave differently on your emulator than the real hardware (intentionally or unintentionally), you have to go all in and fully emulate the timing interactions faithfully.

So that's the problem, and why we want to solve it. I've said emulating at this level is harder. It's harder for two main reasons:
1. Obtaining precise information on the timing of operations the devices performs is difficult, as the details are often not documented, or where they are the documentation is usually insufficient and/or incorrect.
2. The very nature of modern CPU architecture is ill-suited to running code which can emulate these systems at high speed.

Point 1 is well understood, and it can be solved over time with hardware testing and analysis. Point 2 is the issue being discussed here though. We want to be able to emulate a system as fast as possible, but trying to solve the timing problem creates challenges for performance.

Unfortunately, CPUs in general are poorly suited to truly concurrent work. A traditional processor wants to run a single stream of sequential instructions. Ideally, it wants that stream of instructions to spend most of its time in relatively the same areas of nicely co-located code, taking the same branches almost every time, working with data that sits close together in memory. That's what CPUs are great at doing at high speed, spinning around the same area of code, working with the same, or at least largely sequential, blocks of data. Where you start to stray from that ideal, you incur performance penalties. Now lets look at what's involved in emulating multiple active devices with perfect timing accuracy. We can't, on a single core, run code for more than one device at the same time. This means we need to interleave them, running (for example) one step from device A, then one from device B, then one from device C, then one from device A again and so on. For devices clocked in the MHz range, we might have to loop over this sequence millions of times a second in order to emulate a system in real-time. Unfortunately, this means we don't get to spend much time spinning around in a small pool of nicely co-located code working with the same data block. This means performance penalties (cache misses in particular). We also have extra insructions being run to compare the relative time of each active device, meaning less cycles are being burned actually emulating functionality. Our Genecyst example had the same kind of loop, but they ran hundreds of steps from each device before switching. This was better suited getting higher performance out of the CPU. We don't have that luxury, we need to interleave the steps at a much finer level, so we inherently take a performance hit here on our single threaded model.

Ok, so we take a performance hit, but does it matter in reality? It depends on the system you're emulating. Ideally, we'd have a single super-fast core (100GHz+ please) and run everything single threaded. That's the easiest option by far. We may still run slower than Genecyst on the same system, but if we can run anything 20 times faster than real-time, who cares right? While that's probably true, the problem is, we don't have a 100GHz core, and we can't expect one to appear anytime soon. Single-thread CPU performance pretty well flatlined 15 years back. We've hit physical limits that to date haven't been overcome, and gains year on year have been extremely modest in this area. If your system is simple enough, you might be able to get away with a single threaded approach for everything, but the more active devices you have and/or the smaller their individual "steps" in time (IE, how fast they're clocked), the more time you take. Adding more features (IE, debugger options) or more accurate emulation (IE, obscure edge cases) often doesn't come free either, and you take a bit of a performance hit, which can start to add up as byuu mentioned. Consider that the MegaDrive+MegaCD+32x combo has 5 cpus, 2 graphics chips, 3 sound chips, plus other specialist hardware. Also consider that it can easily take hundreds or thousands of cycles to emulate in software what can be done in one cycle by a specialist device in hardware. Now throw in trying to get that perfect timing accuracy, interleaving steps from each device millions of times a second. If you're trying to make all this run full speed on a 3-4GHz core single threaded, it's becoming a bit of a squeeze. The way I see it, there are two fundamental approaches to this problem:
1. Try and make a more efficient single-threaded (cooperative-threaded) model
2. Try and use a multi-threaded (preemptive-threaded) model

Byuu is shooting for option 1 here, and is having problems with getting everything done in real-time on one thread. I went for option 2 in Exodus, and had different problems, which I'll go into below. Ultimately I think if you want a model that scales to more complex systems like this, you really have no choice but to look at multi-threaded models, the trick is figuring out what an efficient multi-threaded model looks like for emulation purposes. I think you can combine the best points of both models into a kind of middle-ground, which is where I see the solution here, and I'll try and outline by the end of this post.

Now a discussion on multithreaded programming on modern CPUs in general. All modern CPUs have more than one core. It's worth noting, the only reason this is the case is because we stopped finding easy ways to make a single CPU core faster. Intel would be pumping out a single core 100GHz monster today if they knew how to build it, and we'd all be better off with it. Instead, we're stuck with multiple slower cores glued together. A multi-core CPU is well suited to running independent blocks of code, with independent blocks of data, in parallel with unknown/variable timing. Two cores can interact, however there's (relatively extreme) overhead for synchronization between the cores. Ideally two cores are occupied with completely separate non-related code and data. If you want to use multiple cores concurrently in a single program, the key to performance is to require interaction (IE, timing synchronization or data sharing) between those cores as little as possible, doing as much work as possible between interactions.

Now consider what we're trying to do in emulation and how we can make it multithreaded. First lets look at the obvious model that doesn't work. A nieve approach would be to say, take each "active" device, and run them in parallel with one device per core. Rather than that loop we talked about where we run "one step from device A, then one from device B, then one from device C", we instead create three threads, one for devices A, B, and C respectively. In those threads, we run steps for each device independently. Right away we have problems. How do we keep them in sync with each other? In the real hardware, the devices run concurrently, but the timing is fixed and inherent. With a multi-core CPU, we can (potentially) emulate the devices in parallel, but we can't reproduce the timing inherently, we need to engineer it in code. Each thread for each device therefore would need to check the execution progress of each other before executing each step, to ensure that no device goes ahead of another device. In that model, where you have N devices, you'd pontetially need to check the progress of N-1 devices each step, and block/wait on them until they'd caught up to where you are before advancing. As discussed above, modern CPUs are bad at doing this, they want you to synchronize between threads as little as possible, running large amounts of code between each interaction. We have the reverse, with millions of interactions per second, and (usually) small amounts of code between each interaction. The CPU doesn't like this. This approach is very slow, and it actually gets worse the more devices you add. You're going to run significantly slower in this model than you would have using the simple single-threaded model we started with.

So what do we do? At this point, it's worth recognizing that the thing that hurts performance most for both our single-threaded or our multi-threaded models is the same: synchronization. Every time we need to switch to or wait for another device, it hurts performance. Our primary goal in either case then is to minimize the points at which we need to synchronize with other executing devices. To do this, we need to take a closer look at when synchronization is actually required, so we can try and minimize the points at which we bring things in sync with each other. To detail this, I define a few "types" of dependencies between devices:

Two-way dependencies:
A two-way dependency is where two active devices need to be kept in perfect sync. These kind of dependencies really suck for emulation, but there's not much you can do about them. In the Mega Drive for example, the 68000 and Z80 can both take ownership over each other's buses, suspending the other CPU until the operation is complete. This is done simply by trying to read or write sections of memory. In both cases, these events occur unpredictably without the involvement of the suspended CPU. This is an example of a worst-case unsolveable two-way dependency. You can never run the 68000 ahead of the Z80, as if the Z80 attempts to access the 68000 bus, but you've already advanced the 68000 a few clock cycles ahead and it's gone ahead and performed a bus operation that never should have occurred as it was supposed to be suspended, you've already advanced with incorrect timing. The same is true in reverse, if the Z80 is further ahead then the 68000 takes bus ownership, you'll already be off with your timing. The only real solution I know of here is to run the devices in lock-step, synchronizing timing after each indivisible execution step where external changes are either made or become visible which can affect the execution of the device. In Exodus currently, where a two-way dependency like this exists, I advance both devices together on a single thread in lock step, IE, I fall back to a single-threaded execution model between these devices.

One-way dependencies:
These are better. With a one-way dependency, a given device is like a "slave" of another. In this scenario, it should be possible to advance the "master" device much further ahead, and only synchronize when it performs an operation where the slave could affect the result. The sub-CPU in the MegaCD is an example of this I believe. To my understanding, there's nothing the sub-CPU can do which will have any effect on the main 68000, until the main 68000 attempts itself to access memory or shared registers to which the sub-CPU has access. In this case, you can afford to run the main 68000 significantly ahead of the sub-cpu, however the sub-cpu still needs to ensure it never advances beyond on the main 68000 in this case, as the main 68000 is capable of doing things like obtaining bus ownership that directly affect the sub-cpu. Depending on the code, these kind of interactions may occur frequently or not at all, but in either case you shouldn't have to synchronize nearly as much as you would with a two-way dependency. This is a case where you should be able to safely let the main CPU run thousands of cycles ahead, then bring the sub-cpu forward thousands of cycles too in one go. This kind of dependency allows you to use two different execution threads for the two devices, but depending on the code being run, it may not perform any better, or may perform more slowly in extreme circumstances.

Shared passive device dependencies:
These are good ones, easy to manage. An example might be dual-port RAM, or some kind of specialist device that is mapped into multiple buses. If multiple devices can interact with this passive device (IE, by reading or writing particular addresses), but they don't know or care what the device is doing unless they "check" on it by accessing its mapped memory range or the like, it's a passive dependency. You don't need any synchronization at all if only one device can access this kind of passive device at a time independently (IE, if there's only one bus), but if a second device can also interact with it, but there's no unsolicited access, it's enough to simply synchronize the connected devices when they attempt to access the shared passive device, IE, by making sure other devices are up to or ahead of the point in time when the access is being made.

Interrupt dependencies:
These need a special note, as there's some important considerations around them. In the Mega Drive, both the VDP and YM2612 can trigger interrupts. Without these interrupts, the YM2612 and VDP (ignoring DMA for a second) are effectively passive devices. They don't make unsolicited interactions with other devices in the system, they just do their own thing and produce their own private, closed output (analog audio and video in particular). Interrupts are the exception though, they're raised by these otherwise passive devices and directly affect the subsequent operation of other devices. They are however, very predictable. In both cases, these interrupts are effectively raised on timers. Without any subsequent external state changes, given the current internal state of the VDP and YM2612, you can predict precisely when the next interrupts will be raised. In Exodus, I actually signal future interrupts ahead of time. The VDP actually flags a "future" interrupt to the 68000 if one can occur based on the current state (IE, the next VINT), and the YM2612 will do the same for the Z80. When the 68000 and Z80 execute, they "see" these pending interrupts in a buffer, but they ignore them until they reach that point in time during their execution. If the internal state of the VDP or YM2612 is changed prior to the time those interrupts would have been raised, and that affects whether that interrupt would occur, my execution model allows these "future" interrupts to be revoked, possibly with updated future interrupts then issued. This model allows me to treat the VDP and YM2612 as purely passive devices, even though they can raise interrupts, as the devices affected by them know what they will do before they do it. This means the 68000 never has to synchronize with the VDP at all unless it's directly interacting with it, and the same is true with the Z80 and the YM2612. Although I've focused on interrupts here, bus requests or any other kind of interaction can use the same approach, as long as the future access is predictable in advance. This allows me to handle VDP external DMA operations in a similar manner.

Private output:
This is important, because it's a note about what isn't a dependency at all. I'm going to focus on the PSG. Let's look at what the PSG actually involves in terms of interaction. It has a set of internal write-only registers, no interrupts, and no observable output from the CPU perspective at all. If you remove it entirely, the CPU wouldn't know, as long as the memory write doesn't fail. Of course, changing PSG registers affects the sound output, but does that mean you need to block, for example, the 68000 CPU emulation while you generate output samples for the PSG? No, it doesn't. You should never have to synchronize with the PSG, as long as you entirely separate the task of producing the sound output. Instead of generating audio samples on port writes, have a separate worker thread for the PSG that's independent of the main execution thread(s) for the system. Keep a buffer which records register state changes and the time they occur, where register writes during system execution simply add entries to this buffer. At a time of your choosing, you dispatch an accumulated block of time which you've finished executing to the worker thread, which then generates the actual output samples, while applying the accumulated register state changes at the appropriate time. If you design this kind of buffer well, it can not only be lock free, but also almost entirely memory allocation and memory barrier free too, giving minimum overhead.

On the same issue, lets now look at the YM2612. It has interrupts, and it has readable output. This requires at least some synchronization to be performed if the status register is read. Does it require the entire analog audio output to be produced though? No, although some obscure test register modes might add some requirements there, in general it just requires the "busy" flag and timer overflow flags to be correct. Now lets look at the VDP. It does have an external DMA mode, where it can take bus ownership and perform external reads itself. This however only occurs on demand, in direct response to register writes. It does have a readable status register, and readable data too. These will certainly require a level of processing to be performed in order to calculate the correct results. Do we need to actually produce the full video output in order to calculate this information though? No, we don't. You can use the same technique of buffering register changes and the time they occur, and perform the actual output generation in a separate worker thread, that "floats" independently from the execution process, and has minimum synchronization requirements.

Getting to the key point summary now, I think there are two main points to maximizing performance for emulation on complex systems:
1. Minimize synchronization
2. Push work off the critical path

On point 1, I outlined some tips above around recognizing some common types of dependencies, and thinking about how to limit when you synchronize devices. You want to go fine-grained with this, and look at everything that can be read, written, or triggered when interacting with other devices. How/when/what you synchronize may vary based on specific addresses that are written, and even specific data that's being written where the data affects the subsequent operations. If you reason about the results, you can identify the weakest possible level of synchronization that's required to still guarantee perfect timing. This is what you want to aim for. The cooperative-threading model byuu uses is a great example of this (Sidenote: do you think the upcoming C++ coroutine work has application to this issue in your model byuu?). A spin on this model can also be used in a multithreaded model, where some devices may be time-sharing on the same thread, and some may be running independently on another thread. Anything that shaves time off synchronization is a good thing.

For point 2, pushing work off the critical path, I'll speak a bit more about it. The critical path is a simple concept, and it has essentially the same meaning/implications in both scheduling and multithreaded programming. Here's a gantt chart I've shamelessley ripped for illustration purposes:
Those items in red are our critical path, things that need to happen sequentially and together form a chain that takes the longest time. In the case of the Mega Drive, that 68000/Z80 combo is going to be on your critical path (unless perhaps you're doing a dynarec or some other kind of reinterpretation/HLE, but I don't know that those techniques combine well with more precise, timing accurate emulation). The speed at which you can emulate those two devices together with perfect timing accuracy is what will be the biggest decider in how fast your emulator runs. It's important to note that, with multiple cores available, it's actually possible to take more cycles (longer CPU time), and make your code theoretically slower, but if it saves time on your critical path it can result in your emulator running faster in the real world (wall clock time). This is because idle CPU cores are effectively a wasted resource. If you can make use of them in otherwise dead time while you're waiting for your critical path items to complete, it's effectively free processing. Your goal is not only to optimize the tasks on the critical path to run faster, it's also to shift as much work off the critical path to background threads as possible. Minimizing synchonisation points, such as through the cooperative-threaded scheduler byuu has employed, optimizes the critical path. My example above of buffering register writes to the PSG and generating samples on a background thread, pushes work off the critical path. In my view, the best model is to combine both techniques.

Exodus does have another lever it can pull, which is a bit more extreme. I built Exodus to support "rollbacks", which allows the system to actually get execution wrong the first time around, roll back to an earlier state, then execute again with advance knowledge about future ordering/timing requirements. This isn't something you want to rely on for normal execution as it results in major performance penalties, since all the work you throw away is dead time, and there's some overhead for implementing it. I use it mostly to support debug features, such as breakpoints, watchpoints, or other conditions which actually want to halt the emulation environment. In these cases the wasted time isn't important as we're stopping anyway, and it allows me to ensure that the time when we want to stop is known, so no devices advance past this point and everything can be perfectly in sync when the system stops. The same feature could also be used to handle really obscure edge-case timing issues, which are unlikely (or never) occur in normal games released for the system, such as crazy test register features, but would be difficult to support or incur significant performance penalties to emulate even if they're not used. Triggering a rollback in these cases might be preferable for simplicity, but generally speaking you would want to try and avoid relying on rollbacks for normal execution, or performance can become very uneven depending on the code being run on the system.

I've just written a novel, so I think I'll leave it there. Hopefully this is useful to someone. Happy to answer questions and listen to feedback/suggestions.

For illustrative purposes, here's an annotated profile snapshot of Exodus while executing a Mega Drive ROM:
I've annotated the threads that are used for particular tasks of interest in red. You can see the 68000+Z80 thread spends almost all its time going flat out. In this case I'm running without throttling, which gets around 130FPS (slightly over 2x realtime) on my system. The output threads for the YM2612/PSG/VDP run in parallel without blocking it. I could probably optimize all these output threads more aggressively, but there's not much point, as it won't make any meaningful difference to execution speed. On a quad core, I could probably afford to throw another 3-4 VDP devices into the same system and it would make little impact on execution time. With the dependencies that exist in the systems, I believe this model will nicely scale to the requirements of the MegaDrive+MegaCD+32x combo.

On C++2a coroutines ... I am not sure how it might impact my design. It depends on whether or not they're stackless (if they are, then their scope is drastically limited to only the simplest of chips, like PSGs), and whether or not they're asynchronous (if they're not, and it looks as though they are not, they're potentially limited to only easily enslaved-chips running off the same master clocks, like the SNES DSP<>SMP. Though this may be possible to work around.) Either way, I'd have to overhaul my scheduler used by all 24 of my emulators pretty substantially, and I ... have a real hangup around consistency. So it's unlikely I will go this route, but I will benchmark it once I have a compiler that supports them.

Good point about interrupts. I ended up abstracting the logic to determine the Vblank and Hblank pins from the SNES PPU, because it was just much too expensive to synchronize the CPU to the PPU at 10.74MHz precision, rather than simply on PPU register writes (which is probably around 50KHz precision) in order to poll the pins from the PPU directly. It's probably the detail that bugs me the most about my SNES emulation. A funny little detail is the SNES CPU does a similar thing and keeps its own counters that it resets based on these signals, and it's why SNES IRQ positions aren't familiar with weird PPU quirks like long dots and missing dots.

There's lots more parallel processes in modern CPUs, too. Like the SNES has an ALU, where you can kick off a multiplication while the CPU continues to run. Many systems have parallel DMA controllers as well.

Emulating the Mega Drive RAM refreshes turns out to be *vastly* more costly than SNES RAM refresh, where the CPU *always* pauses for a fixed amount of time, no matter what, at a given point. We are mostly saved by the CPU both being the thing that gets paused, and the thing that pauses, during these refreshes, so we can simulate it with simple counters and calculations instead of having to emulate eg a /RAS, /DTACK, etc pin.

But you know ... as much as I like to say that I stick to 16-bit and earlier emulation because it lets me design my emulators the way I want, modeled closely to hardware ... this isn't actually true. The Game Boy's wave RAM/trigger acknowledgement differences on writes further poked holes into that. higan is full of necessary evil workarounds to keep above 60fps.

FPGAs are such an easy way out from all of this mess. Such a shame they are not ubiquitous.

I think this discussion lack clarifying objectives. Is it "cycle accuracy" what are you aiming for? Is it speed? Perhaps a mix of both?

And then what is “cycle accuracy” really?, as its name says should be that at the end of every cycle the chip status is the same as emulator status, but I bet no emulator that truly works accomplishes that at 100%.

Is like all the sudden you find out that there is a blue and grey object on top of your head [!], and that makes you understand in order to have a road map, you need first to make clear you objectives. Otherwise is just comparing rocks with space ships.

I think this discussion lack clarifying objectives. Is it "cycle accuracy" what are you aiming for? Is it speed? Perhaps a mix of both?

And then what is “cycle accuracy” really?, as its name says should be that at the end of every cycle the chip status is the same as emulator status, but I bet no emulator that truly works accomplishes that at 100%.

Those are good questions. Let me try and answer. When I personally use the term "cycle accuracy", what I'm referring to is getting the interactions between devices at precisely the correct relative timing, so that devices within the system interact with the same timing as real hardware. Essentially, it means that a device communicates externally on the same "cycle", IE, clock cycle, as it would have on the real system, and other devices become aware of these communications at the same cycle they would have seen them on the real system. Importantly, this doesn't imply that internally within a device, steps of execution occur in the exact same way or with the exact same internal timing as they would on the real device, rather the device is viewed as a "black box" from this perspective, and we only care about the externally visible events that affect other devices.

As for what we're aiming for, well again I can only speak for myself, but I'm aiming for a series of things, but there's priorities between them. I place timing accuracy (IE, cycle accuracy) above performance, meaning I won't do anything to make the timing inaccurate, for the sake of performance. I place things above accuracy though, in particular aspects of architecture and design. I won't do cheap "hacks" which violate the design model I've created, even if they make it much harder to achieve timing accuracy, or they hurt performance. In total, I defined 8 objectives that my emulator aims to achieve, and I ranked them against each other to give an understanding of the overall "design philosophy" I used. If you want to read more about it, you can see those objectives here:https://www.exodusemulator.com/about/design-philosophy
That'll give you a better idea of how I approach issues of balancing accuracy and performance, as well as other goals. Other emulator authors will have different philosophies and approaches.