Here's something interesting I've been investigating for the last few days... All of Koei's NES games except for Mahjong Taikai contain a standard bytecode interpreter, and the games consist almost entirely of bytecode except for the interpreter itself, the code that runs in NMI (PPU transfers and music/sound), and some "syscall" routines that perform tasks like bank switching and PPU register setting.

There are two versions of the bytecode interpreter, one used by the MMC1 games (Nobunaga's Ambition, Genghis Khan, Romance of the Three Kingdoms, and Famicom Top Management) and one used by the MMC5 games. The only differences between the versions are that the MMC5 version uses the mapper's multiplication registers, and has an extra routine that seems to be an optimized version of the syscall dispatcher for syscalls that only take one argument for calling syscalls from native 6502 code (the standard dispatcher copies arguments from the bytecode stack to a set of static addresses in zero page; the fast dispatcher expects the caller to have done this itself)

The bytecode is essentially an ideal C machine: it's mostly stack-based with a frame pointer and two 16-bit/32-bit registers, one "left" (destination for arithmetic ops) and the other "right" (source for arithmetic ops). It's integer-only but supports all the standard C operators including data and function pointers, both signed and unsigned division/modulus, signed and unsigned bit shifts, and bitfield extraction (with optional sign extension) and insertion. Almost all operations have 16-bit and signed 32-bit versions. 8-bit data are either zero-extended or sign-extended to 16 bits when read into a register, and the stack is 16-bit aligned.

Each bytecode function begins with a JSR to the interpreter, followed by the stack frame size as a negative 16-bit word ($FC $FF for a routine with two local variables, etc.) followed by the bytecode. The interpreter is reentrant, and the same bytecode instruction is used to call either a bytecode function or an assembly language one. Function arguments are pushed on the stack (with the caller responsible for stack cleanup--there's a "call xxxx and then unwind stack by n" instruction) and return values appear to be returned in the "left" register (the bytecode "return" instruction clobbers "right" but preserves "left").

The bytecode is so much an ideal C machine that I'm quite sure that the games were written in C (strings in the ROMs also have C-style formatting) which was compiled to bytecode rather than native 6502 code to save ROM space. I only wonder whether Koei wrote the compiler and interpreter themselves, or whether it's a product they licensed that may have been used by other developers as well.

Last edited by AWJ on Wed Aug 02, 2017 7:35 pm, edited 1 time in total.

I would be interested in digging into the SNES games to see if they use the same byte code system.

The fact that a couple of their games (ie Gemfire, Uncharted Waters) were released on both the NES and SNES, might support that idea. Porting would be a whole lot easier if both are running the same (or almost-the-same) interpreter.

I would be interested in digging into the SNES games to see if they use the same byte code system.

The fact that a couple of their games (ie Gemfire, Uncharted Waters) were released on both the NES and SNES, might support that idea. Porting would be a whole lot easier if both are running the same (or almost-the-same) interpreter.

Koei's early games (1980s to early 1990s) were released on numerous computer and console platforms including Z80 (PC88), x86 (PC98, IBM PC), 68K (X68K, Amiga, Megadrive) and I think one or two 6809 machines. It makes sense that they would keep as much of the game code as possible in platform-neutral C. Apparently the Megadrive versions of Genghis Khan 2, etc. are much much snappier than the respective SNES versions, which would be explained if the former were compiled to native 68000 and the latter use bytecode.

ETA: I'm still chewing through the interpreter to figure out all the instructions. I've just found two instructions that implement C switch statements, one for non-contiguous cases and the other for contiguous cases.

Making a KOEI player akin to say SCUMMVM might be nice as well. HD uncharted waters HIT ME

I think you misunderstand. This bytecode isn't like SCUMM at all. SCUMM is a domain-specific language for third-person point-and-click adventure games. This bytecode isn't domain-specific at all; it's a machine language for a virtual 16/32-bit CPU suitable to run an integer-only subset of C. The only bytecode instructions that are any higher-level than a single 8086 or 68000 instruction are the two switchtable instructions. The "syscalls" aren't part of the bytecode language per se; they get called from bytecode using the regular call instruction. The syscall dispatcher is just to copy arguments and return values between the bytecode stack and fixed zero-page addresses. It's part of the interpreter because it has to use the interpreter's stack pointer to find the arguments.

Its a VM, be it like the CLR or SCUMM its still a VM. The important part is, Does it runs lots of 6502, then drop into the VM for the maths?OrIt mostly runs in the VM and has a syscall for set CHR-BANK, Call Script, SetSprite,ShowDialog,ScrollScreen etc ?

If its the first probably not so useful for VMing but it probably gets you the simulation. If the second then you run the VM, trap the syscalls and make your code do what they need and boom you have the game. Basically like SCUMMVM, although the syscalls are probably game specific?

Its a VM, be it like the CLR or SCUMM its still a VM. The important part is, Does it runs lots of 6502, then drop into the VM for the maths?OrIt mostly runs in the VM and has a syscall for set CHR-BANK, Call Script, SetSprite,ShowDialog,ScrollScreen etc ?

If its the first probably not so useful for VMing but it probably gets you the simulation. If the second then you run the VM, trap the syscalls and make your code do what they need and boom you have the game. Basically like SCUMMVM, although the syscalls are probably game specific?

The "syscalls" are also very low-level. Don't think "display character x's portait" or "print string at coordinates x, y", think "copy x bytes to VRAM upload buffer and wait for the NMI thread to upload it". Even if you trapped the syscalls you'd need to emulate every part of a NES except the 6502 to run the games, and it wouldn't get you anywhere in terms of improving their graphics. With SCUMMVM, because the language is high-level, you can add higher resolution/color depth graphics to a game or modernize its interface just by replacing some resources, but with the Koei games you'd need very deep game-specific hooks. It wouldn't really be any easier than doing the same to a NES game written in straight 6502.

What knowledge of the bytecode would help with is reverse-engineering the simulation algorithms and modifying the games.

In addition, zero page address $00 is used throughout the interpreter as a 16-bit temporary and $10 as a 32-bit temporary.

The stack starts at $05FF (in all the games I've checked) and grows downward, and the stack pointer points to the LSB of the last item pushed (a descending full stack). Some games (e.g. Bandit Kings of Ancient China) check the stack pointer in their vblank handler and if it is below $0200 (i.e. the bytecode stack has started to overwrite the 6502 stack) they display "OVERFLOW" using sprites and lock up.

Bytecode can freely use the stack as temporary storage, but the frame pointer is controlled by the interpreter and only changes on function entry and exit. Local variables and function arguments are addressed relative to the frame pointer.

The left and right registers are so (unofficially) named because all arithmetic and logical instructions operate on them as such, storing the result in left. Division divides left by right and stores the quotient in left, bit shifts shift left by right places, etc. There are no arithmetic flags; instead, there are instructions corresponding to signed and unsigned versions of every comparison operator, all of which store a boolean result (0 or 1) in left, and conditional jumps consist of "jump if left != 0" and "jump if left == 0".

The virtual CPU supports 16-bit and 32-bit operations, with the result always the same size as the operands (e.g. 16-bit multiplication produces a possibly truncated 16-bit result) All 32-bit instructions begin with a $B7 prefix byte. 16-bit operations only use and only update the lower 16 bits of the left and right registers. 8-bit data in memory is always extended to 16 bits when loaded into a register, and the stack also only supports pushing and popping 16-bit or 32-bit sized data.

Last edited by AWJ on Tue May 16, 2017 10:48 am, edited 1 time in total.

Bytecode is organized into functions, every one of which begins with a machine language JSR to the interpreter, followed by the frame size needed for the function's local variables (as a 16-bit negative number), followed by the actual bytecode. The first thing the interpreter does is pop two return addresses from the 6502 stack. The first one is the return address of the JSR to the interpreter, which is used to find the stack frame size and the bytecode. The second is the return address of the caller of the bytecode function, which is pushed onto the bytecode stack along with the bytecode stack pointer, frame pointer and instruction pointer. The frame pointer is then set to the current stack pointer, and the stack pointer to the new frame pointer minus the requested frame size (actually, since the size is stored as a negative number, it's technically the frame pointer plus the frame size)

The bytecode return instruction does the reverse of the interpreter prologue: it restores the stack pointer, frame pointer and instruction pointer from the stack frame, and jumps to the stored return address + 1.

In this way, bytecode functions and machine language routines can call each other freely, and bytecode function calls can be nested to any depth without growing the 6502 stack, only the bytecode stack. When a bytecode function calls a bytecode function, the return address that gets transferred from the 6502 stack to the bytecode stack is an address inside the interpreter itself.

There are a couple of apparent redundancies: As described above, the interpreter prologue stores the previous instruction pointer on the stack, but the bytecode call instruction also pushes the instruction pointer onto the stack. I have not discovered what the "return address" pushed by the call instruction is used for, if anything. Also, the interpreter prologue copies 8 bytes onto the stack (6502 return address, stack pointer, frame pointer, instruction pointer) but it subtracts 9 from the stack pointer rather than 8, leaving a one-byte "hole" which I haven't yet found the purpose of.

The following is a diagram of a stack frame at the point when the first bytecode instruction in a function is executed. Everything above the line was pushed by the calling bytecode function, while everything below the line was pushed/allocated by the interpreter prologue.

Most instructions implicitly operate on left and right and store the result in left. The only instructions that have addressing modes for data are "load", "store", "push" and "add". The "store" instruction does not take immediate arguments (obviously), and the "add" instruction only takes immediate arguments (in addition to the implicit register-register mode which every arithmetic instruction has). So there are exactly four instructions for every allowed combination of addressing mode and data size: "load to left", "load to right", "push", and either "store left" or "add to left".

Quick frame-relativeThe frame-pointer-relative address of one of 12 local variables or one of 4 function arguments is encoded in the lower 4 bits of the instruction, as follows:

Note that the interpreter doesn't look up these offsets from a lookup table; instead, to improve execution speed (since these are the most commonly used instructions) there are actually 16 versions of each instruction each with a hardcoded offset. This mode only applies to 16-bit data.

Quick immediateAn immediate value from 0 to 15 is encoded in the lower 4 bits of the instruction and zero-extended to 16 bits.

Near frame-relativeA signed offset to the frame pointer from -128 to +127 is stored in the byte following the instruction. This mode only applies to 16-bit data.

Far frame-relativeA signed offset to the frame pointer from -32768 to +32767 is stored in the 2 bytes following the instruction. This mode applies to 8-bit, 16-bit or 32-bit data; 8-bit data is zero-extended to 16 bits.

ImmediateAn immediate value is stored in the 1, 2 or 4 bytes following the instruction. A byte-sized immediate value is sign-extended to 16 bits (unlike 8-bit data read from memory, which is always zero-extended). The exception is the stack cleanup instruction, which takes an unsigned 8-bit immediate argument. There are no 32-bit immediate push or add instructions.

AbsoluteAn absolute address is stored in the 2 bytes following the instruction. This mode applies to 8-bit, 16-bit or 32-bit data; 8-bit data is zero-extended to 16 bits. Jump and call instructions also take an absolute address.

IP relativeIP relative addressing is only used by branch instructions. There are two versions of each branch instruction: one which interprets the byte after the instruction as a positive number from 0 to 255 (branch ahead), and one which interprets it as a negative number from -256 to -1 (branch back).

Last edited by AWJ on Tue May 16, 2017 6:43 pm, edited 1 time in total.

Who is online

You cannot post new topics in this forumYou cannot reply to topics in this forumYou cannot edit your posts in this forumYou cannot delete your posts in this forumYou cannot post attachments in this forum