Display posts from previous

Sort by

Very productive week, and, overall, quite a bit more satisfying than last week. The week was split straight down the middle, 50/50, as I will explain.

The first few days were dedicated, as expected, to continuing my analysis of LuaJIT performance, getting profiling tools in place, and performing profile-guided optimization to get a better feel for where the milliseconds were going. I'm pleased to report that this all went quite well!

LuaJIT's built-in profiler is indeed really nice. It didn't take much effort at all to set up, and the results helped me get a much better perspective on the existing code's performance. Unfortunately, the results of several profiling runs of the PAX demo weren't exactly what we like to see in performance optimization. Frequently, something like the '80/20' rule will apply in profiling and optimization: it's common to find that a small minority of the code is consuming a majority of the CPU time (hence the oft-abused quote "premature optimization is the root of all evil"). This is actually what we want to see, since it means concentrating effort on that small minority can yield big gains. Alas, it wasn't the case with the demo. The profiler reported a lot of very small percentages accross the entirety of the code, with the best 'bottlenecks' peaking at maybe ~5%. The meaning of such a result is basically: there's not much we can do to reap big performance gains, at least, not much we can do easily.

Still, the profiling was helpful enough for me to weed out the peaks that did exist (mostly by transferring them to the C library), resulting in a decent gain of somewhere between 25% and 40% perf. At the end of it all, my heavy-duty machine was able to run a 1000-ship battle at 60 fps with a millisecond or so to spare. Remember, of course, that not all logic is implemented, so this isn't representative of what you'll be able to do in LT, although it's encouraging. The goal for me has always been to allow ~100-ship-battles at decent framerates, and at this rate, I believe we'll be able to do that.

That result, among others, has convinced me that we're safe to move forward with LJ. I'm not able to hop off the fence and plant my feet as firmly as I'd like to in the "this will work" grass, but for now, what I've seen has allowed practical me to conclude that, with continued profiling and keeping a careful eye on the milliseconds, we will be able to proceed with LJ. This is also taking into consideration that, if things do start to get too heavy, I'll likely be able to make enough little scale cuts here and there to push out a still-high-quality LT 1.0. In other words, I really just wanted to remind you all that I'm keeping practicality in mind and am open to, for example, having to slightly lower the scale of system economies or out-of-system simulations, etc, if it means the difference between 60 and 10 fps

Quite exciting I'd be using more smileys and squirrels if I weren't so scared of being excited about solutions

---

Now, in complete contrast to the first half, the latter half of my week was spent hedging my bets. Several people have asked the question "what happens if LuaJIT falls through"? The answer is that I do have quite a few options remaining in my mental priority queue. I explained in my 'State of Limit Theory' post that, at that point, I was splitting my effort roughly 50/50 between LJ and a different solution involving code generation from script. Since then, the prior has escalated to taking most of my time, what with PAX and the excitement thereafter of having something I can play and iterate on. Now that I've decided to move forward cautiously with LuaJIT, I do intend to resume my 'hedging' efforts. Although I'll continue allocating the majority of my time to 'LT in LJ' in hopes that it'll pull through for us, I'll still be giving ~20-30% to R&D for the next potential solution in the priority queue. This week, in an attempt to spin up that next solution and give it some momentum, I gave it a one-time boost of about half my time.

So, currently sitting at second place in the queue is what I think of as a 'nuclear' option -- a sort of fail-safe, brute-force option. Indeed, it may be a bit scary that a nuclear option is next in line behind LJ, but Practical Josh is thinking of it this way: "I'm getting really tired of solutions failing; had I gone with a nuclear option in the first place, I'd be done by now." Flawed logic, because I didn't have the know-how to pursue this two years ago (again, the knowledge gained from those failed attempts is nontrivial!) But, now that the option is accessible to me, I intend to give it a low-priority JoshOS thread as my hedge against LJ failing, at least until something really promising displaces it in the queue.

Over the past few days I've worked hard to spin up this solution, and have already got quite some results to show for it. I built a working x86 assembler/linker (of course, not full-featured yet, but working as in 'capable of generating in-memory programs & functions using a restricted instruction set.' Just a few hours ago I had my first successful test run in which I used it to create some simple math functions at run-time. They worked! It was an intense few days of reading hundreds of pages of Intel's software developer guide to better understand CPU architecture and, more importantly, to understand enough to translate assembly to machine code (that's what an assembler does, among other things). Let me re-iterate since some of you are probably scared that I've lost my mind now: this is for plan Z, in the event that LJ fails (and fails hard enough for me to walk away from that substantial time investment) and no other solution manifests in the mean time. We should hope that it ends up being nothing more than a nice learning experience for me. Which, by the way, is very refreshing to have every now and then. My brain feels like it got to go on vacation during the second half of the week thanks to this work (yes, I love learning enough that reading Intel manuals and looking at opcodes in hex feels like a vacation). I was surprised at the relative ease of doing it -- I initially assumed we were looking at months until any results at all; turned out to be a few days.

This so-called nuclear option allows something along the lines of 'compiled LTSL,' i.e. LTSL running at nearly the same speed as the pre-compiled code. Some people have asked about the feasibility of such a solution before, since it seems like a somewhat-straightforward way of solving the original problem of having run into one kind of limit with C++ and another kind of limit with LTSL. I honestly didn't have the knowledge to do that kind of thing before. But, having spent so much time with intermediate representations (thanks, failed Python solution), JITs & asm (thanks LJ + failed Python solution), codegen (thanks, C++ codegen along with the 32 other metacompilers I've written in my life..), I now do. Best of all, I have a number of attack vectors for doing so (TCC, LLVM's IR/code generator), but the most appealing to me (for plan Z) is the solution involving the minimal number of intermediate pieces that could fail: my own direct-to-machine-code, in-memory compiler/assembler/linker. I already wrote an LTSL 'compiler' that takes it as far as an expression tree (essentially an executable AST), and now I've got the humble yet promising beginnings of the latter parts. It might sound like a monumental task, but the reality is that I only have to implement a tiny fraction of what 'general-purpose' tools do. In other words, when I first created LTSL and the compiler for it, I didn't worry about making it a feature-complete language, I worried about making it super easy to write ship generation algorithms, UI, gameplay code in. Similarly, for assembler/linker, I am not concerned with outputting executables, nor shared libs, I'm not concerned with PLTs and GOTs, nor many of the other complexities. I'm concerned with writing a relatively-small subset of ops to memory and executing them from a program that's already running (the LT core) in order to quickly evaluate things (like LTSL expression trees). When you cut the problem down to the core, it's a lot less scary. Still not easy, of course, but totally within the realm of feasibility.

(I know that I'll still catch flak from some people on this, despite stressing that it's plan Z and not by any means the focus of the coming weeks / months of development...but hey, I said I was going to be honest about the good, bad, and ugly )

---

Finally, the other 10% ( ) of my week was spent creating the beginnings of a benchmarking script utility to help me be more precise about my observations and decisions when it comes to all these different solutions and their respective performances. This was, in part, motivated by a simple test that I performed pitting C, unoptimized C (i.e. no compiler optimization), D, Lua (the standard interpreter), LuaJIT, and Python against one another in a small bit of code. In doing so (and finding a few interesting oddities along the way), I realized that, for someone who spends so much time thinking about a very hard problem related to perf, I've got a startling deficit of objective, quantitative data to back me on my calls. It's been a good week for quantitative measurement, what with the profiling runs and all. I decided that I need more of that in my life

It's a very simple little utility (written in Python!), but with a few more hours of work it'll help me record concrete information about relative performances in the face of many variables (language/solution, piece of code, machine, OS, CPU architecture, GPU, etc.) As I continue development, I plan to toss new benchmarking tests into the mix to help me stay abreast of FPLT in a precise way. Hopefully I'll be able to quote precise figures in the future instead of just "too slow, close but not quite, good enough, really fast." I'll also have a way to be less on-the-fence about things like LJ since I'll have hard data to say what is and what isn't working.

---

This coming week, I'm excited to say that we resume 'development as usual,' at least in some sense. The majority of time is to be devoted to implementing more LT in LJ rather than scrutinizing the existing code for performance, which should really prove to be a relief from walking on eggshells with a profiler in-hand! Of course, the minority is to be devoted to 'assembling' ( ) a nuclear warhead, so that's a rather fun contrast.

In all, I'd estimate a 100% chance of fun!

PS ~ I'm still working on the whole brevity thing when it comes to my logs. They look a lot smaller in full-screen vim and somehow become startlingly-long when I paste them into the forum's post editor

b8 39 05 00 00 5d c3

“Whether you think you can, or you think you can't--you're right.” ~ Henry Ford

*is confused about the difficulties in writing an assembly->machine code translater*
isnt that just reading a table and filling in numbers?
(i literally did that by hand in school for a while, was a pain. but you arent no real programmer until you sit there with paper, a biro and a hex entry field for your microcontroller )

Cornflakes_91 wrote:*is confused about the difficulties in writing an assembly->machine code translater*
isnt that just reading a table and filling in numbers?
(i literally did that by hand in school for a while, was a pain. but you arent no real programmer until you sit there with paper, a biro and a hex entry field for your microcontroller )

In this case you would have to build a general case to cover more than one CPU wouldn't you?
You would need to find/build tables for every system you want it to possibly run on.

Cornflakes_91 wrote:*is confused about the difficulties in writing an assembly->machine code translater*
isnt that just reading a table and filling in numbers?
(i literally did that by hand in school for a while, was a pain. but you arent no real programmer until you sit there with paper, a biro and a hex entry field for your microcontroller )

In this case you would have to build a general case to cover more than one CPU wouldn't you?
You would need to find/build tables for every system you want it to possibly run on.

thats what the x86 / AMD64 instruction set comes from. unified assembly code.
you can have extensions to that, but the base set is always there.

(which is what screwed me over when i aquired a copy of no mans sky and couldnt run it because of special instruction set extensions)

I'm Computer Science dumb... Am I reading that right in that Plan Z is to create a whole Operating System for LT, an equivalent to Linux or Windows?

Challenging your assumptions is good for your health, good for your business, and good for your future. Stay skeptical but never undervalue the importance of a new and unfamiliar perspective.Imagination Fertilizer
Beauty may not save the world, but it's the only thing that can

I still vote the nuclear option, if only because I would love to write generic non-c and get c-like performance. :V

Good to see we are now in DEFCON 2, and are preparing the code warheads for an all out assault on one-point-oh.

Whew, I'm glad to see this kind of reaction. Even though it's still the back-up/nuclear option, I thought I was going to get railed for it..glad you guys see the point

Cornflakes_91 wrote:*is confused about the difficulties in writing an assembly->machine code translater*
isnt that just reading a table and filling in numbers?
(i literally did that by hand in school for a while, was a pain. but you arent no real programmer until you sit there with paper, a biro and a hex entry field for your microcontroller )

*Enjoys that you are confused about it*, def takes a smart person to be confused by why it's not trivial

It's more than that but still not a huge deal. Assembly (at least for CISC architectures like x86 and x64) is still an abstraction in the sense that it abstracts the various opcodes into equivalent asm. Some opcodes are very simple (ret = 0x55), some have a few forms that vary based on whether operands are registers, immediates, or memory values (most binary ops e.g. add / adc, sub/sbb, (i)mul, (i)div, etc.) Often these are specialized to different opcodes in the case of an 8-bit immediate. Then some are just plain difficult due to the number of possibilities and having to encode these possibilities in the opcode bitfields (mov). Some have specializations when eax is the first operand. The most difficult part for me was encoding memory locations like [esp + 4 * ecx + 0xC], etc, for example, for mov. There are a huge number of possible combinations, generating the correct opcodes requires understanding the so-called ModR/M and SIB bytes for memory addressing, which was a lot of what I needed the Intel docs for. Especially annoying (but necessary) is that even the ModRM/SIB scheme has exceptions, like esp as a base, ebp as a base with 8-bit displacement (IIRC). None of this is different at the assembly level. Again, it's not rocket science, but it takes time and a thorough reading of instruction set references. Then there's linking, which requires building a relocation table during codegen & patching addresses into the binary for calls like jmp, jcc, call. Not difficult but yet again takes some time.

But yeah, I suppose cases + tables within tables + math.

SSE/SSE2/SSE3/SSE4 will be more difficult but that bridge is pretty far away.

Cornflakes_91 wrote:thats what the x86 / AMD64 instruction set comes from. unified assembly code.
you can have extensions to that, but the base set is always there.

(which is what screwed me over when i aquired a copy of no mans sky and couldnt run it because of special instruction set extensions)

Yeah there are some differences, the instruction set isn't unified but the processor micro-architecture is built in such a way that x86 can be executed in compatibility mode very easily on x64 architectures. Native x64 assembly is a bit different -- 16 general registers (rax, rcx, rbx, ... rsp, rbp, etc instead of eax, ecx, ebx ... esp, ebp), different calling convention (quite better than cdecl actually!), no real segment register usage (good), and some other stuff, the main complication being knowing when you need to use REX prefix bytes to promote an operand to 64 bits rather than 32 (and generally keeping track of operand sizes).

Overall I don't think it will take much more effort to extend the assembler to x64, but I wanted to get the simpler x86 working first.

And yeah as per extensions, that sucks that NMS didn't bother to use feature detection before trying special ops. What processor are you running (they're either using some really fancy ops (like AVX / AVX2 maybe; wouldn't surprise me considering they targeted a version of opengl that most drivers didn't fully support at the time...) or you've got a really old CPU) ?

Hyperion wrote:I'm Computer Science dumb... Am I reading that right in that Plan Z is to create a whole Operating System for LT, an equivalent to Linux or Windows?

Heh no no...not THAT nuclear... What Silverware said.

LT: Ships with custom, LT-Inside LT386 processor and 1080LTX graphics card with 64LTs of RAM! Only supports LTOS and LTSL. LTSoft Paint and FreeCelLT not included.

Talvieno wrote:Great post, Josh! I'll get it moved to the RSS/News feed thread.

Thanks sir

“Whether you think you can, or you think you can't--you're right.” ~ Henry Ford

Its a pretty old AMD cpu (1090T) by now, but age isnt the problem, its the even older intel only (at the time of release for my processor) instruction set extension.
my processor can do SSE4.1(?) And 4.2 was required or something along those lines.

Its a pretty old AMD cpu (1090T) by now, but age isnt the problem, its the even older intel only (at the time of release for my processor) instruction set extension.
my processor can do SSE4.1(?) And 4.2 was required or something along those lines.

Ah that's interesting, thanks for the info. I was planning to do up to SSE2 assumed, and up to SSE4.1 utilized but first checking for support.

Also that's ridiculous, SSE4.2 adds literally 5 instructions (and I don't see them being critical to 3D/game logic in any way?) whereas SSE4 added like 50+. I get requiring SSE4(kind of)...but 4.2...Shame shame.

“Whether you think you can, or you think you can't--you're right.” ~ Henry Ford

Its a pretty old AMD cpu (1090T) by now, but age isnt the problem, its the even older intel only (at the time of release for my processor) instruction set extension.
my processor can do SSE4.1(?) And 4.2 was required or something along those lines.

Ah that's interesting, thanks for the info. I was planning to do up to SSE2 assumed, and up to SSE4.1 utilized but first checking for support.

Also that's ridiculous, SSE4.2 adds literally 5 instructions (and I don't see them being critical to 3D/game logic in any way?) whereas SSE4 added like 50+. I get requiring SSE4(kind of)...but 4.2...Shame shame.

Read up on it again and my cpu can do an AMD variation on SSE3 called SSE4a and nms utilised sse4.1 up to some point (it was apparently patched but i couldnt be assed to get a patched version)

Have you tried reaching out to devs with similar games to see how they've done things in the past? NMS, X3 & X Rebirth, Elite, Rebel Galaxy, Space Engineers, Empyrion, etc., these are all games with strong similarities to LT, and in fact, LT is often mentioned along with them in articles that discuss these kinds of games.

Your name and your game have been out there for years, I'm sure they'd be willing to talk to you for an hour or so.