All of them eats 100% of CPU core and cannotrun fullspeed on the old low-powered netbook CPU. It gives only 30-40 FPS without frameskipping.(real performance of Atom N550 is about good Pentuim 3~1000MHz)

Nestopia result is only 40-45% CPU load, and it run at 60FPS fullspeed!FCEUX with old inaccurate scanline-based PPU render + low sound quality have the same performance.

I wonder how you did so _heavy_ optimization of your cycle accurate emulator!

A:

Marty wrote:

Thanks Eugene. Nice to hear from you again, hope you are well.Doing code optimizations without sacrifizing accuracy can bereal fun and I'm happy to see it payed off.

As for the various optimizations I did to Nestopia at the time,I heavily used Intel Vtune and AMD CodeAnalyst profiler tofind hotspots in the code and also let the compiled IA-32 assemblycode guide me through it.

I also made heavy use of (or abused if you will) C++ template styleprogramming, or concept-oriented programming as I'd like to call it,to let the compiler do as much work for me as possible and allowingme to not needing to repeat myself in code.

Using the Intel C++ Compiler and Microsoft Visual Studio at the time, I also fine-tuned many parts of the code through compiler directives to givehints to the compiler on what to optimize for speed and what to optimize forsize.

As a programmer, having a knowledge of low level stuff such as branch-prediction, cache-linesand other things helped a lot during development. Even if you're developing something in a high-levellanguage such as Java, C#, Python, I believe you can still influence performance a great deal in the wayyou structure and arrange your code.

For reference and maybe not surprisingly, the most critical method for performance in the whole Nestopia codebase I remember was Ppu::renderPixel(). That one I remember optimizing to be ~20FPS faster just by re-arrangingsome statements. That was surely a branch-condition killer, but by allowing the CPU to not stall and do other work in parallell made it almost free.

Just to give my 2 cents on the Mesen part of things:Mesen is more or less optimized to run on at least 3 different threads (emulation, frame decoding/filtering, rendering), so running it on any dual-core will result in sub-par performance, especially since I abuse spin-locks due to their low latency - but spin locks only work well so long as you actually have free cores to run them on without slowing down the other thread you are waiting on.On the upside, this design means Mesen can run HDNes' HD packs with very little FPS drop on a quad core machine (e.g Super Mario Bros goes from ~250fps to maybe ~190fps on my machine)

Also, a lot of features result in small performance losses - e.g: debugger, cheats, unlimited sprites, support for HDNes' HD pack format, etc. I try to optimize where I can (using VS' profiler mostly) - but I'm not going to start trying to optimize cache misses in an era where most low end computers are already able to run Mesen at 2-3x normal speed. This made a lot of sense in 2005, but not so much in 2017 (Stuff like raspberry pis aside)

And, this is a matter of taste of course (I'm sure some people might say the same about Mesen's code), but Nestopia's code can be very hard to process. In particular, stuff like this drives me insane:https://github.com/rdanbrook/nestopia/b ... .cpp#L1435It might result in slightly faster code, but in my opinion makes the code so much harder to read.This kind of thing also leads to Nestopia's PPU code being 3.4k lines, against Mesen's 1k lines.

P.S: I'm not trying to hate on Nestopia or anything - it's a great emulator, and I've used it as a reference countless of times!

You're right, nestopia core code is very hard to maintain.FHorse takes 2 days of debugging to understand and solve bug in NstPpu,and it was too difficult!

Quote:

Watching routine Ppu::Run you can easily see that the flag of VBLANK and the NMI are performed to cycles.hClock 681 (HCLOCK_VBLANK_0), 682 (HCLOCK_VBLANK_1) and 684 (HCLOCK_VBLANK_2) that is virtually one scanline after the VACTIVE (240) scanlines. This is fine for PPU_RP2C02 (NTSC) and PPU_RP2C07 (PAL) but not for PPU_DENDY that needs another 50 sleep scanlines. What I did was nothing more than adding these 50 scanlines first of the HCLOCK_VBLANK_0 that are performed only when the variable (ssleep >= 0) and this is true only in the case of PPU_DENDY. This way I left intact the logic with which the routine work for NTSC and PAL, intervening only for Dendy mode because ssleep will always be -1 for PPU_RP2C02 and PPU_RP2C07.I hope that I was able to explain well.

---By the way, even "easy" things, like minor improvements to NSF-player, FDS, and region selector are difficult too.Feos tried to fix some other minor bugs, but can solve only FDS.NSF-player and region selector still need to fix. Current patches by feos are broken.

I just released Mesen 0.8.1 which contains a fair amount of speed optimizations.I know that it's still nowhere near as fast as Nestopia (and most likely never will be) - but I'm curious how much of an impact the changes had on your 1.5ghz Atom CPU.

Is there any chance you could compare 0.8.0 and 0.8.1 with a few games and let me know how much of a speed improvement you get?

On my i5 750 I get +22%, on an i3 I get +23%, and on a very old AMD Opteron (dual-core 2.0ghz) I get +26% performance. With a bit of luck, the Atom might get a +25-30% performance boost, too.

Who is online

Users browsing this forum: No registered users and 5 guests

You cannot post new topics in this forumYou cannot reply to topics in this forumYou cannot edit your posts in this forumYou cannot delete your posts in this forumYou cannot post attachments in this forum